Solved – References containing arguments against null hypothesis significance testing

hypothesis testingp-valuereferencesstatistical significance

In the last few years I've read a number of papers arguing against the use of null hypothesis significance testing in science, but didn't think to keep a persistent list. A colleague recently asked me for such a list, so I thought I'd ask everyone here to help build it. To start things off, here's what I have so far:

Best Answer

Chris Fraley has taught a whole course on the history of the debate (the link seems to be broken, even though it's still on his official site; here is a copy in Internet Archive). His summary/conclusion is here (again, archived copy). According to Fraley's homepage, the last time he taught this course was in 2003.

He prefaces this list with an "Instructor's bias":

Although my goal is to facilitate lively, deep, and fair discussions on the issues at hand, I believe that it is necessary to make my bias explicit from the outset. Paul Meehl once stated that "Sir Ronald [Fisher] has befuddled us, mesmerized us, and led us down the primrose path. I believe that the almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in the soft areas is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology." I echo Meehl's sentiment. One of my goals in this seminar is to make it clear why I believe this to be the case. Furthermore, I expect you, by the time you have completed this seminar, to be able to articulate and defend your stance on the NHST debate, regardless of what that stance is.

I'll copy in the reading list in case the course page ever disappears:

Week 1. Introduction: What is a Null Hypothesis Significance Test? Facts, Myths, and the State of Our Science

  • Lyken, D. L. (1991). What’s wrong with psychology? In D. Cicchetti & W.M. Grove (eds.), Thinking Clearly about Psychology, vol. 1: Matters of Public Interest, Essays in honor of Paul E. Meehl (pp. 3 – 39). Minneapolis, MN: University of Minnesota Press.

Week 2. Early Criticisms of NHST

  • Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-115.

  • Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834.

  • Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416-428.

  • Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66, 423-437. [optional]

Week 3. Contemporary Criticisms of NHST

  • Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.

  • Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311-339). Hillsdale, NJ: Lawrence Erlbaum Associates.

  • Schmidt, F. L. & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger (Eds.) What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Lawrence Erlbaum Associates.

  • Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley. (Chapter 2 [A Critique of Significance Tests]) [optional]

Week 4. Rebuttal: Advocates of NHST Come to Its Defense

  • Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379-390.

  • Hagen, R. L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15-24.

  • Wilkinson, L., & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.

  • Wainer, H. (1999). One cheer for null hypothesis significance testing. Psychological Methods, 6, 212-213.

  • Mulaik, S. A., Raju, N. S., & Harshman, R. A. (1997). There is a time and place for significance testing. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger , Eds. What if there were no significance tests? (pp. 65-116). Mahwah, NJ: Lawrence Erlbaum Associates. [optional]

Week 5. Rebuttal: Advocates of NHST Come to Its Defense

  • Abelson, R. P. (1997). On the surprising longevity of flogged horses: Why there is a case for the significance test. Psychological Science, 8, 12-15.

  • Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56, 16-26.

  • Scarr, S. (1997). Rules of evidence: A larger context for the statistical debate. Psychological Science, 8, 16-17.

  • Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D. (1996). Effect sizes and p values: What should be reported and what should be replicated? Psychophysiology, 33, 175-183.

  • Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5, 241-301. [optional]

  • Harris, R. J. (1997). Significance tests have their place. Psychological Science, 8, 8-11. [optional]

Week 6. Effect Size

  • Rosenthal, R. (1984). Meta-analytic procedures for social research. Beverly Hills, CA: Sage. [Ch. 2, Defining Research Results]

  • Chow, S. L. (1988). Significance test or effect size? Psychological Bulletin, 103, 105-110.

  • Abelson, R. P. (1985). A variance explanation paradox: When a little is a lot. Psychological Bulletin, 97, 129-133. [optional]

Week 7. Statistical Power

  • Hallahan, M., & Rosenthal, R. (1996). Statistical power: Concepts, procedures, and applications. Behaviour Research and Therapy, 34, 489-499.

  • Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309-316.

  • Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153. [optional]

  • Maddock, J. E., Rossi, J. S. (2001). Statistical power of articles published in three health-psychology related journals. Health Psychology, 20, 76-78. [optional]

  • Thomas, L. & Juanes, F. (1996). The importance of statistical power analysis: An example from Animal Behaviour. Animal Behaviour, 52, 856-859. [optional]

  • Rossi, J. S. (1990). Statistical power of psychological research: What have we gained in 20 years? Journal of Consulting and Clinical Psychology, 58, 646-656. [optional]

  • Tukey, J. W. (1969). Analyzing data: Sanctification or detective work? American Psychologist, 24, 83-91. [optional]

Week 8. Confidence Intervals and Significance testing

  • Gardner, M. J., & D. G. Altman. 1986. Confidence intervals rather than P values: Estimation rather than hypothesis testing. British Medical Journal, 292, 746-750.

  • Cumming, G., & Finch, S. (2001). A primer on understanding, use, and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement, 61, 532-574.

  • Loftus, G. R., & Masson, M.E.J. (1994). Using confidence intervals in within-subject designs. Psychonomic Bulletin and Review, 1, 476-490.

Week 9 [note: we are skipping this section]. Theoretical Modeling: Developing Formal Models of Natural Phenomena

  • Haefner, J. W. (1996). Modeling biological systems: Principles and applications. New York: International Thomson Publishing. (Chapters 1 [Models of Systems] & 2 [The Modeling Process])

  • Loehlin, J. C. (1992). Latent variable models: An introduction to factor, path, and structural analysis. Hillsdale, NJ: Lawrence Erlbaum Associates. (Chapter 1 [Path models in factor, path and structural analysis], p. 1-18]

  • Grant, D. A. (1962). Testing the null hypothesis and the strategy of investigating theoretical models. Psychological Review, 69, 54-61. [optional]

  • Binder, A. (1963). Further considerations on testing the null hypothesis and the strategy and tactics of investigating theoretical models. Psychological Review, 70, 107-115. [optional]

  • Edwards, W. (1965). Tactical note on the relations between scientific and statistical hypotheses. Psychological Bulletin, 63, 400-402. [optional]

Week 10. What is the Meaning of Probability? Controversy Concerning Relative Frequency and Subjective Probability

  • Salsburg, D. (2001). The lady tasting tea: How statistics revolutionized science in the twentieth century. New York: W. H. Freeman. (Chapters 10, 11, & 12)

  • Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley. (Chapters 4, 5, & 6)

  • Pruzek, R. M. (1997). An introduction to Bayesian inference and its applications. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger , Eds. What if there were no significance tests? (pp. 287-318). Mahwah, NJ: Lawrence Erlbaum Associates.

  • Rindskoph, D. M. (1997). Testing "small," not null, hypothesis: Classical and Bayesian Approaches. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger (Eds). What if there were no significance tests? (pp. 319-332). Mahwah, NJ: Lawrence Erlbaum Associates.

  • Edwards, W., Lindman, H., Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193-242. [optional]

Week 11. Theory Appraisal: Philosophy of Science and the Testing and Amending of Theories

  • Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1, 108-141.

  • Roberts, S. & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing. Psychological Review, 107, 358-367.

Week 12. Theory Appraisal: Philosophy of Science and the Testing and Amending of Theories

  • Urbach, P. (1974). Progress and degeneration in the "IQ debate" (I). British Journal of Philosophy of Science, 25, 99-125.

  • Serlin, R. C. & Lapsley, D. K. (1985). Rationality in psychological research: The good-enough principle. American Psychologist, 40, 73-83.

  • Dar, R. (1987). Another look at Meehl, Lakatos, and the scientific practices of psychologists. American Psychologist, 42, 145-151.

  • Gholson, B. & Barker, P. (1985). Kuhn, Lakatos, & Laudan: Applications in the history of physics and psychology. American Psychologist, 40, 755-769. [optional]

  • Faust, D., & Meehl, P. E. (1992). Using scientific methods to resolve questions in the history and philosophy of science: Some illustrations. Behavior Therapy, 23, 195-211. [optional]

  • Urbach, P. (1974). Progress and degeneration in the "IQ debate" (II). British Journal of Philosophy of Science, 25, 235-259. [optional]

  • Salmon, W. C. (1973, May). Confirmation. Scientific American, 228, 75-83. [optional]

  • Meehl, P. E. (1993). Philosophy of science: Help or hindrance? Psychological Reports, 72, 707-733. [optional] Manicas. P. T., & Secord, P. F. (1983). Implications for psychology of the new philosophy of science. American Psychologist, 38, 399-413. [optional]

Week 13. Has the NHST Tradition Undermined a Non-Biased, Cumulative Knowledge Base in Psychology?

  • Cooper, H., DeNeve, K., & Charlton, K. (1997). Finding the missing science: The fate of studies submitted for review by a human subjects committee. Psychological Methods, 2, 447-452.

  • Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115-129.

  • Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 1-20.

  • Berger, J. O. & Berry, D. A. (1988). Statistical analysis and illusion of objectivity. American Scientist, 76, 159-165.

Week 14. Replication and Scientific Integrity

  • Smith, N. C. (1970). Replication studies: A neglected aspect of psychological research. American Psychologist, 25, 970-975.

  • Sohn, D. (1998). Statistical significance and replicability: Why the former does not presage the latter. Theory and Psychology, 8, 291-311.

  • Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195-244.

  • Platt, J. R. (1964). Strong Inference. Science, 146, 347-353.

  • Feynman, R. L. (1997). Surely you’re joking, Mr. Feynman! New York: W. W. Norton. (Chapter: Cargo-cult science).

  • Rorer, L. G. (1991). Some myths of science in psychology. In D. Cicchetti & W.M. Grove (eds.), Thinking Clearly about Psychology, vol. 1: Matters of Public Interest, Essays in honor of Paul E. Meehl (pp. 61 – 87). Minneapolis, MN: University of Minnesota Press. [optional]

  • Lindsay, R. M. & Ehrenberg, A. S. C. (1993). The design of replicated studies. The American Statistician, 47, 217-228. [optional]

Week 15. Quantitative Thinking: Why We Need Mathematics (and not NHST per se) in Psychological Science

  • Aiken, L. S., West, S. G., Sechrest, L., & Reno, R. R. (1990). Graduate training in statistics, methodology, and measurement in psychology: A survey of Ph.D. programs in North America. American Psychologist, 45, 721-734.

  • Meehl, P. E. (1998, May). The power of quantitative thinking. Invited address as recipient of the James McKeen Cattell Award at the annual meeting of the American Psychological Society, Washington, DC.

Related Question