Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Michael N. Mitchell" <Michael.Norman.Mitchell@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: two-tailed tests |
Date | Fri, 09 Jul 2010 11:21:53 -0700 |
Dear AllI have found this to be a very fascinating discussion. I feel that it reveals much about our science and our training as scientists. Issues have been raised that go beyond one and two tailed tests, issues that might be categorized as science, research methods, or even philosophy of science (with David's recent post regarding Popper). I would like to comment, but do so specifically in a very personal way in terms of my experience trained as an experimental psychologist based on methods a philosophy of science that drew largely from books and publications of the 1960s and 1970s.
It feels to me that there is a schism between the underlying philosophy of science and statistical methods taught and used. During my training as an experimental psychologist, the research model strongly encouraged "a priori" null and alternative hypotheses that specifically specified the pattern of expected results. The tests usually involved categorical (factor) variables with more than two levels, and we were taught to construct planned comparisons to show the expected direction of results that would be consistent with our hypothesis (and theory) of interest. Likewise, if interactions were present, we were to plot the planned pattern of interaction and statistically test for that exact pattern. I remember one lecture where the professor suggested that it was common practice (and should continue to be common practice) to mail oneself a sealed letter which contained the hypotheses and predicted pattern of results and this would show that the predictions pre-dated the data collection. These practices, I believe, were firmly rooted in the underlying philosophy of science.
This way of doing things seems so quaint these days. It has been years since I have seen research that uses this kind of scientific model. In practice, I see much more of the usage of shotgun statistical tests with multiple predictors, multiple outcomes, multiple subgroups, with a very blurred distinction between hypotheses, planned results, and p values less than 0.05. With the richness of modern datasets in terms of observations, predictors and outcomes, and the easy access of statistical computations, it seems like a natural progression to try to extract as much information as possible from a modern dataset using modern techniques. While these practices are understandable and practical, are they still consistent with the scientific foundations from which statistical hypothesis testing derived? Are "p values" from such statistical analyses really "hypothesis tests"? Statistical output can contain dozens or hundreds of "p values"... do researchers really have dozens or hundreds of clearly articulated null hypotheses? (And, let's ignore the Type I error rate issue for now.) Or, is it that the hypothesis was not really conceived until a result less than 0.05 was discovered, at which point the researchers self-deceiving intelligent mind invents an ex-post facto hypothesis that "predicts" the result.
Modern statistical tools no longer seem synchronized with the underlying science of hypothesis testing (as I was taught as an experimental psychologist). I feel that the usage of statistical tools used to be tightly integrated into a foundation of a scientific model and a philosophy of science. Over time, I feel the usage of these tools has drifted away from this original foundation, and I have not seen a new foundation that has replaced it. Instead, it feels that statistics (as practiced) is data mining cloaked in the legitimacy of scientific hypothesis testing. A traditional 1960s experimental psychologists style of hypothesis testing is not suited for today, and the data mining style statistical analysis does not seem suited to the foundations of hypothesis testing as taught in the 1960s.
Instead of practicing statistics as a form of data mining and pretending that we are testing hypotheses in a planned fashion, perhaps instead we need a scientific model that is still philosophically and scientifically justified and that supports the ability to data mine. Then, researchers could candidly describe what they do in the context of good scientific practices.
Thanks for your patience if you have gotten this far. Please remember that although I make some general statements here, they are all rooted and reflect my personal experiences.
Best regards, Michael N. Mitchell Data Management Using Stata - http://www.stata.com/bookstore/dmus.html A Visual Guide to Stata Graphics - http://www.stata.com/bookstore/vgsg.html Stata tidbit of the week - http://www.MichaelNormanMitchell.com On 2010-07-09 8.52 AM, David Bell wrote:
Statistical tests perform many functions. When one-tailed is theoretically/philosophically justified: In a Popperian theory-testing world, any result other than positive significance means the theory is disconfirmed. So in this case, a one-tailed test is exactly appropriate. (Popper, Karl R. 1965. Conjectures and refutations: The growth of scientific knowledge. New York: Harper& Row.) When one-tailed is NOT theoretically/philosophically justified: A two-tailed test implies meaning in both tails. In applied studies that test outcomes, a positive result is taken to mean the drug works, non-significance is taken to mean that the drug does not work (actually, that there is not evidence that the drug works), but a significant negative result means that the drug is actively harmful. In this case, a one-tailed test would have obscured the harmfulness of the drug by lumping harm with non-effectiveness.. Remember that the purpose of a statistical test to the scientific community as a sociological community is to protect the community from the enthusiasm of researchers. If a researcher is going to tell us his/her theory is true, we want the chance of being fooled (Type I error, false positive) to be strictly limited. We as readers and consumers don’t worry about disappointment of a researcher with a good theory but bad data (Type II error, false negative). As a practical matter, many journals insist on two-tailed tests for several reasons. One reason is because two-tailed tests are conservative (even though the stated probability level is inaccurate, it means that the researcher has only half the chance to fool us). Another is to discourage researchers from “cherry-picking” close calls – e.g., reporting one-tailed .05 significance instead of two-tailed .10 (“marginal”) significance. Another is that, in the endeavor of science, one result is relatively unimportant, so conservative is better for the overall process of science. Dave ==================================== David C. Bell Professor of Sociology Indiana University Purdue University Indianapolis (IUPUI) (317) 278-1336 ==================================== On Jul 8, 2010, at 10:10 PM, Eric Uslaner wrote:... If you found an extreme result in the wrong direction, you would better be advised to check your data for errors or your model for very high levels of multicollinearity. If someone found that strong Republican party identifiers are much more likely than strong Democrats to vote for the Democratic candidate, no one would give that finding any credibility no matter what a two-tailed test showed. The same would hold for a model in economics that showed a strong negative relationship between investment in education and economic growth. Of course, those who put such faith in two-tailed tests would say: You never know. Well, you do. That's the role of theory. Now I don't know what goes on substantively (or methodologically) in the biological sciences, e.g. Seems as if many people are very much concerned with the null hypothesis. In the social sciences, we learn that the null hypothesis is generally uninteresting. When it is interesing, as in my own work on democracy and corruption, it is to debunk the argument that democracy leads to less corruption (with the notion that democracy might lead to more corruption seen as not worth entertaining seriously). So again, one would use a one-tailed test and expect that there would be no positive relation between democratization and lack of corruption. Of course, Nick is right that graphics often tell a much better story. But that is not the issue here. Two-tailed tests are largely an admission that you are going fishing. They are the statistical equivalent of stepwise regression (http://www.rand.org/pubs/papers/P4260/). Ric Uslaner * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/
* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/