Solved – Why are lower p-values not more evidence against the null? Arguments from Johansson 2011

hypothesis testingp-valuephilosophicalstatistical significance

Johansson (2011) in "Hail the impossible: p-values, evidence, and likelihood" (here is also link to the journal) states that lower $p$-values are often considered as stronger evidence against the null. Johansson implies that people would consider evidence against the null to be stronger if their statistical test outputted a $p$-value of $0.01$, than if their statistical test outputted a $p$-value of $0.45$. Johansson lists four reasons why the $p$-value cannot be used as evidence against the null:

  1. $p$ is uniformly distributed under the null hypothesis and can therefore never indicate evidence for the null.
  2. $p$ is conditioned solely on the null hypothesis and is therefore unsuited to quantify evidence, because evidence is always
    relative in the sense of being evidence for or against a hypothesis
    relative to another hypothesis.
  3. $p$ designates probability of obtaining evidence (given the null), rather than strength of evidence.
  4. $p$ depends on unobserved data and subjective intentions and therefore implies, given the evidential interpretation, that the
    evidential strength of observed data depends on things that did not
    happen and subjective intentions.

Unfortunately I cannot get an intuitive understanding from Johansson's article. To me a $p$-value of $0.01$ indicates there is less chance the null is true, than a $p$-value of $0.45$. Why are lower $p$-values not stronger evidence against null?

Best Answer

My personal appraisal of his arguments:

  1. Here he talks about using $p$ as evidence for the Null, whereas his thesis is that $p$ can't be used as evidence against the Null. So, I think this argument is largely irrelevant.
  2. I think this is a misunderstanding. Fisherian $p$ testing follows strongly in the idea of Popper's Critical Rationalism that states you cannot support a theory but only criticize it. So in that sense there only is a single hypothesis (the Null) and you simply check if your data are in accordance with it.
  3. I disagree here. It depends on the test statistic but $p$ is usually a transformation of an effect size that speaks against the Null. So the higher the effect, the lower the p value---all other things equal. Of course, for different data sets or hypotheses this is no longer valid.
  4. I am not sure I completely understand this statement, but from what I can gather this is less a problem of $p$ as of people using it wrongly. $p$ was intended to have the long-run frequency interpretation and that is a feature not a bug. But you can't blame $p$ for people taking a single $p$ value as proof for their hypothesis or people publishing only $p<.05$.

His suggestion of using the likelihood ratio as a measure of evidence is in my opinion a good one (but here the idea of a Bayes factor is more general), but in the context in which he brings it is a bit peculiar: First he leaves the grounds of Fisherian testing where there is no alternative hypothesis to calculate the likelihood ratio from. But $p$ as evidence against the Null is Fisherian. Hence he confounds Fisher and Neyman-Pearson. Second, most test statistics that we use are (functions of) the likelihood ratio and in that case $p$ is a transformation of the likelihood ratio. As Cosma Shalizi puts it:

among all tests of a given size $s$ , the one with the smallest miss probability, or highest power, has the form "say 'signal' if $q(x)/p(x) > t(s)$, otherwise say 'noise'," and that the threshold $t$ varies inversely with $s$. The quantity $q(x)/p(x)$ is the likelihood ratio; the Neyman-Pearson lemma says that to maximize power, we should say "signal" if it is sufficiently more likely than noise.

Here $q(x)$ is the density under state "signal" and $p(x)$ the density under state "noise". The measure for "sufficiently likely" would here be $P(q(X)/p(x) > t_{obs} \mid H_0)$ which is $p$. Note that in correct Neyman-Pearson testing $t_{obs}$ is substituted by a fixed $t(s)$ such that $P(q(X)/p(x) > t(s) \mid H_0)=\alpha$.

Related Question