Solved – Why are lower p-values not more evidence against the null? Arguments from Johansson 2011

hypothesis testingp-valuephilosophicalstatistical significance

Johansson (2011) in "Hail the impossible: p-values, evidence, and likelihood" (here is also link to the journal) states that lower $p$-values are often considered as stronger evidence against the null. Johansson implies that people would consider evidence against the null to be stronger if their statistical test outputted a $p$-value of $0.01$, than if their statistical test outputted a $p$-value of $0.45$. Johansson lists four reasons why the $p$-value cannot be used as evidence against the null:

$p$ is uniformly distributed under the null hypothesis and can therefore never indicate evidence for the null.

$p$ is conditioned solely on the null hypothesis and is therefore unsuited to quantify evidence, because evidence is always
relative in the sense of being evidence for or against a hypothesis
relative to another hypothesis.

$p$ designates probability of obtaining evidence (given the null), rather than strength of evidence.

$p$ depends on unobserved data and subjective intentions and therefore implies, given the evidential interpretation, that the
evidential strength of observed data depends on things that did not
happen and subjective intentions.

Unfortunately I cannot get an intuitive understanding from Johansson's article. To me a $p$-value of $0.01$ indicates there is less chance the null is true, than a $p$-value of $0.45$. Why are lower $p$-values not stronger evidence against null?

Best Answer

My personal appraisal of his arguments:

Here he talks about using $p$ as evidence for the Null, whereas his thesis is that $p$ can't be used as evidence against the Null. So, I think this argument is largely irrelevant.
I think this is a misunderstanding. Fisherian $p$ testing follows strongly in the idea of Popper's Critical Rationalism that states you cannot support a theory but only criticize it. So in that sense there only is a single hypothesis (the Null) and you simply check if your data are in accordance with it.
I disagree here. It depends on the test statistic but $p$ is usually a transformation of an effect size that speaks against the Null. So the higher the effect, the lower the p value---all other things equal. Of course, for different data sets or hypotheses this is no longer valid.
I am not sure I completely understand this statement, but from what I can gather this is less a problem of $p$ as of people using it wrongly. $p$ was intended to have the long-run frequency interpretation and that is a feature not a bug. But you can't blame $p$ for people taking a single $p$ value as proof for their hypothesis or people publishing only $p<.05$.

His suggestion of using the likelihood ratio as a measure of evidence is in my opinion a good one (but here the idea of a Bayes factor is more general), but in the context in which he brings it is a bit peculiar: First he leaves the grounds of Fisherian testing where there is no alternative hypothesis to calculate the likelihood ratio from. But $p$ as evidence against the Null is Fisherian. Hence he confounds Fisher and Neyman-Pearson. Second, most test statistics that we use are (functions of) the likelihood ratio and in that case $p$ is a transformation of the likelihood ratio. As Cosma Shalizi puts it:

among all tests of a given size $s$ , the one with the smallest miss probability, or highest power, has the form "say 'signal' if $q(x)/p(x) > t(s)$, otherwise say 'noise'," and that the threshold $t$ varies inversely with $s$. The quantity $q(x)/p(x)$ is the likelihood ratio; the Neyman-Pearson lemma says that to maximize power, we should say "signal" if it is sufficiently more likely than noise.

Here $q(x)$ is the density under state "signal" and $p(x)$ the density under state "noise". The measure for "sufficiently likely" would here be $P(q(X)/p(x) > t_{obs} \mid H_0)$ which is $p$. Note that in correct Neyman-Pearson testing $t_{obs}$ is substituted by a fixed $t(s)$ such that $P(q(X)/p(x) > t(s) \mid H_0)=\alpha$.

Related Solutions

Solved – References containing arguments against null hypothesis significance testing

Chris Fraley has taught a whole course on the history of the debate (the link seems to be broken, even though it's still on his official site; here is a copy in Internet Archive). His summary/conclusion is here (again, archived copy). According to Fraley's homepage, the last time he taught this course was in 2003.

He prefaces this list with an "Instructor's bias":

Although my goal is to facilitate lively, deep, and fair discussions on the issues at hand, I believe that it is necessary to make my bias explicit from the outset. Paul Meehl once stated that "Sir Ronald [Fisher] has befuddled us, mesmerized us, and led us down the primrose path. I believe that the almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in the soft areas is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology." I echo Meehl's sentiment. One of my goals in this seminar is to make it clear why I believe this to be the case. Furthermore, I expect you, by the time you have completed this seminar, to be able to articulate and defend your stance on the NHST debate, regardless of what that stance is.

I'll copy in the reading list in case the course page ever disappears:

Week 1. Introduction: What is a Null Hypothesis Significance Test? Facts, Myths, and the State of Our Science

Lyken, D. L. (1991). What’s wrong with psychology? In D. Cicchetti & W.M. Grove (eds.), Thinking Clearly about Psychology, vol. 1: Matters of Public Interest, Essays in honor of Paul E. Meehl (pp. 3 – 39). Minneapolis, MN: University of Minnesota Press.

Week 2. Early Criticisms of NHST

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-115.

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834.

Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416-428.

Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66, 423-437. [optional]

Week 3. Contemporary Criticisms of NHST

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.

Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311-339). Hillsdale, NJ: Lawrence Erlbaum Associates.

Schmidt, F. L. & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger (Eds.) What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Lawrence Erlbaum Associates.

Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley. (Chapter 2 [A Critique of Significance Tests]) [optional]

Week 4. Rebuttal: Advocates of NHST Come to Its Defense

Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379-390.

Hagen, R. L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15-24.

Wilkinson, L., & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.

Wainer, H. (1999). One cheer for null hypothesis significance testing. Psychological Methods, 6, 212-213.

Mulaik, S. A., Raju, N. S., & Harshman, R. A. (1997). There is a time and place for significance testing. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger , Eds. What if there were no significance tests? (pp. 65-116). Mahwah, NJ: Lawrence Erlbaum Associates. [optional]

Week 5. Rebuttal: Advocates of NHST Come to Its Defense

Abelson, R. P. (1997). On the surprising longevity of flogged horses: Why there is a case for the significance test. Psychological Science, 8, 12-15.

Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56, 16-26.

Scarr, S. (1997). Rules of evidence: A larger context for the statistical debate. Psychological Science, 8, 16-17.

Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D. (1996). Effect sizes and p values: What should be reported and what should be replicated? Psychophysiology, 33, 175-183.

Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5, 241-301. [optional]

Harris, R. J. (1997). Significance tests have their place. Psychological Science, 8, 8-11. [optional]

Week 6. Effect Size

Rosenthal, R. (1984). Meta-analytic procedures for social research. Beverly Hills, CA: Sage. [Ch. 2, Defining Research Results]

Chow, S. L. (1988). Significance test or effect size? Psychological Bulletin, 103, 105-110.

Abelson, R. P. (1985). A variance explanation paradox: When a little is a lot. Psychological Bulletin, 97, 129-133. [optional]

Week 7. Statistical Power

Hallahan, M., & Rosenthal, R. (1996). Statistical power: Concepts, procedures, and applications. Behaviour Research and Therapy, 34, 489-499.

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309-316.

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153. [optional]

Maddock, J. E., Rossi, J. S. (2001). Statistical power of articles published in three health-psychology related journals. Health Psychology, 20, 76-78. [optional]

Thomas, L. & Juanes, F. (1996). The importance of statistical power analysis: An example from Animal Behaviour. Animal Behaviour, 52, 856-859. [optional]

Rossi, J. S. (1990). Statistical power of psychological research: What have we gained in 20 years? Journal of Consulting and Clinical Psychology, 58, 646-656. [optional]

Tukey, J. W. (1969). Analyzing data: Sanctification or detective work? American Psychologist, 24, 83-91. [optional]

Week 8. Confidence Intervals and Significance testing

Gardner, M. J., & D. G. Altman. 1986. Confidence intervals rather than P values: Estimation rather than hypothesis testing. British Medical Journal, 292, 746-750.

Cumming, G., & Finch, S. (2001). A primer on understanding, use, and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement, 61, 532-574.

Loftus, G. R., & Masson, M.E.J. (1994). Using confidence intervals in within-subject designs. Psychonomic Bulletin and Review, 1, 476-490.

Week 9 [note: we are skipping this section]. Theoretical Modeling: Developing Formal Models of Natural Phenomena

Haefner, J. W. (1996). Modeling biological systems: Principles and applications. New York: International Thomson Publishing. (Chapters 1 [Models of Systems] & 2 [The Modeling Process])

Loehlin, J. C. (1992). Latent variable models: An introduction to factor, path, and structural analysis. Hillsdale, NJ: Lawrence Erlbaum Associates. (Chapter 1 [Path models in factor, path and structural analysis], p. 1-18]

Grant, D. A. (1962). Testing the null hypothesis and the strategy of investigating theoretical models. Psychological Review, 69, 54-61. [optional]

Binder, A. (1963). Further considerations on testing the null hypothesis and the strategy and tactics of investigating theoretical models. Psychological Review, 70, 107-115. [optional]

Edwards, W. (1965). Tactical note on the relations between scientific and statistical hypotheses. Psychological Bulletin, 63, 400-402. [optional]

Week 10. What is the Meaning of Probability? Controversy Concerning Relative Frequency and Subjective Probability

Salsburg, D. (2001). The lady tasting tea: How statistics revolutionized science in the twentieth century. New York: W. H. Freeman. (Chapters 10, 11, & 12)

Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley. (Chapters 4, 5, & 6)

Pruzek, R. M. (1997). An introduction to Bayesian inference and its applications. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger , Eds. What if there were no significance tests? (pp. 287-318). Mahwah, NJ: Lawrence Erlbaum Associates.

Rindskoph, D. M. (1997). Testing "small," not null, hypothesis: Classical and Bayesian Approaches. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger (Eds). What if there were no significance tests? (pp. 319-332). Mahwah, NJ: Lawrence Erlbaum Associates.

Edwards, W., Lindman, H., Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193-242. [optional]

Week 11. Theory Appraisal: Philosophy of Science and the Testing and Amending of Theories

Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1, 108-141.

Roberts, S. & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing. Psychological Review, 107, 358-367.

Week 12. Theory Appraisal: Philosophy of Science and the Testing and Amending of Theories

Urbach, P. (1974). Progress and degeneration in the "IQ debate" (I). British Journal of Philosophy of Science, 25, 99-125.

Serlin, R. C. & Lapsley, D. K. (1985). Rationality in psychological research: The good-enough principle. American Psychologist, 40, 73-83.

Dar, R. (1987). Another look at Meehl, Lakatos, and the scientific practices of psychologists. American Psychologist, 42, 145-151.

Gholson, B. & Barker, P. (1985). Kuhn, Lakatos, & Laudan: Applications in the history of physics and psychology. American Psychologist, 40, 755-769. [optional]

Faust, D., & Meehl, P. E. (1992). Using scientific methods to resolve questions in the history and philosophy of science: Some illustrations. Behavior Therapy, 23, 195-211. [optional]

Urbach, P. (1974). Progress and degeneration in the "IQ debate" (II). British Journal of Philosophy of Science, 25, 235-259. [optional]

Salmon, W. C. (1973, May). Confirmation. Scientific American, 228, 75-83. [optional]

Meehl, P. E. (1993). Philosophy of science: Help or hindrance? Psychological Reports, 72, 707-733. [optional] Manicas. P. T., & Secord, P. F. (1983). Implications for psychology of the new philosophy of science. American Psychologist, 38, 399-413. [optional]

Week 13. Has the NHST Tradition Undermined a Non-Biased, Cumulative Knowledge Base in Psychology?

Cooper, H., DeNeve, K., & Charlton, K. (1997). Finding the missing science: The fate of studies submitted for review by a human subjects committee. Psychological Methods, 2, 447-452.

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115-129.

Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 1-20.

Berger, J. O. & Berry, D. A. (1988). Statistical analysis and illusion of objectivity. American Scientist, 76, 159-165.

Week 14. Replication and Scientific Integrity

Smith, N. C. (1970). Replication studies: A neglected aspect of psychological research. American Psychologist, 25, 970-975.

Sohn, D. (1998). Statistical significance and replicability: Why the former does not presage the latter. Theory and Psychology, 8, 291-311.

Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195-244.

Platt, J. R. (1964). Strong Inference. Science, 146, 347-353.

Feynman, R. L. (1997). Surely you’re joking, Mr. Feynman! New York: W. W. Norton. (Chapter: Cargo-cult science).

Rorer, L. G. (1991). Some myths of science in psychology. In D. Cicchetti & W.M. Grove (eds.), Thinking Clearly about Psychology, vol. 1: Matters of Public Interest, Essays in honor of Paul E. Meehl (pp. 61 – 87). Minneapolis, MN: University of Minnesota Press. [optional]

Lindsay, R. M. & Ehrenberg, A. S. C. (1993). The design of replicated studies. The American Statistician, 47, 217-228. [optional]

Week 15. Quantitative Thinking: Why We Need Mathematics (and not NHST per se) in Psychological Science

Aiken, L. S., West, S. G., Sechrest, L., & Reno, R. R. (1990). Graduate training in statistics, methodology, and measurement in psychology: A survey of Ph.D. programs in North America. American Psychologist, 45, 721-734.

Meehl, P. E. (1998, May). The power of quantitative thinking. Invited address as recipient of the James McKeen Cattell Award at the annual meeting of the American Psychological Society, Washington, DC.

Solved – Interpreting p-values in Fisher vs Neyman-Pearson frameworks

Fisher uses p-values as a continuous measure of evidence against a null hypothesis?

Perhaps. What convinces you of this?

So a p-value of 0.06 would indicate that there is no difference and the null hypothesis is true?

Not at all. How did you go from 'continuous measure of evidence against' to 'there is no difference'?

In particular, Fisher would not make the mistake of thinking that failure to reject makes $H_0$ actually true.

Does a p-value greater than alpha indicate that there is >5% chance of a type one error occuring

No, for two reasons.

(i) if $p>\alpha$ you won't reject, so you can't commit a type I error at all

(ii) You don't even have an $\alpha$ probability of making a type I error, since the type I error rate is a conditional probability, and in real situations, the joint probability is close to zero (that is, point null hypotheses are almost never exactly true; you can only make a type I error when they are exactly true).

[ ... I suppose that I'm arguably acting more as a Bayesian there]

Best Answer

Related Solutions

Solved – References containing arguments against null hypothesis significance testing

Week 1. Introduction: What is a Null Hypothesis Significance Test? Facts, Myths, and the State of Our Science

Week 2. Early Criticisms of NHST

Week 3. Contemporary Criticisms of NHST

Week 4. Rebuttal: Advocates of NHST Come to Its Defense

Week 5. Rebuttal: Advocates of NHST Come to Its Defense

Week 6. Effect Size

Week 7. Statistical Power

Week 8. Confidence Intervals and Significance testing

Week 9 [note: we are skipping this section]. Theoretical Modeling: Developing Formal Models of Natural Phenomena

Week 10. What is the Meaning of Probability? Controversy Concerning Relative Frequency and Subjective Probability

Week 11. Theory Appraisal: Philosophy of Science and the Testing and Amending of Theories

Week 12. Theory Appraisal: Philosophy of Science and the Testing and Amending of Theories

Week 13. Has the NHST Tradition Undermined a Non-Biased, Cumulative Knowledge Base in Psychology?

Week 14. Replication and Scientific Integrity

Week 15. Quantitative Thinking: Why We Need Mathematics (and not NHST per se) in Psychological Science

Solved – Interpreting p-values in Fisher vs Neyman-Pearson frameworks

Related Question