Johansson (2011) in "Hail the impossible: p-values, evidence, and likelihood" (here is also link to the journal) states that lower $p$-values are often considered as stronger evidence against the null. Johansson implies that people would consider evidence against the null to be stronger if their statistical test outputted a $p$-value of $0.01$, than if their statistical test outputted a $p$-value of $0.45$. Johansson lists four reasons why the $p$-value cannot be used as evidence against the null:
- $p$ is uniformly distributed under the null hypothesis and can therefore never indicate evidence for the null.
- $p$ is conditioned solely on the null hypothesis and is therefore unsuited to quantify evidence, because evidence is always
relative in the sense of being evidence for or against a hypothesis
relative to another hypothesis.- $p$ designates probability of obtaining evidence (given the null), rather than strength of evidence.
- $p$ depends on unobserved data and subjective intentions and therefore implies, given the evidential interpretation, that the
evidential strength of observed data depends on things that did not
happen and subjective intentions.
Unfortunately I cannot get an intuitive understanding from Johansson's article. To me a $p$-value of $0.01$ indicates there is less chance the null is true, than a $p$-value of $0.45$. Why are lower $p$-values not stronger evidence against null?
Best Answer
My personal appraisal of his arguments:
His suggestion of using the likelihood ratio as a measure of evidence is in my opinion a good one (but here the idea of a Bayes factor is more general), but in the context in which he brings it is a bit peculiar: First he leaves the grounds of Fisherian testing where there is no alternative hypothesis to calculate the likelihood ratio from. But $p$ as evidence against the Null is Fisherian. Hence he confounds Fisher and Neyman-Pearson. Second, most test statistics that we use are (functions of) the likelihood ratio and in that case $p$ is a transformation of the likelihood ratio. As Cosma Shalizi puts it:
Here $q(x)$ is the density under state "signal" and $p(x)$ the density under state "noise". The measure for "sufficiently likely" would here be $P(q(X)/p(x) > t_{obs} \mid H_0)$ which is $p$. Note that in correct Neyman-Pearson testing $t_{obs}$ is substituted by a fixed $t(s)$ such that $P(q(X)/p(x) > t(s) \mid H_0)=\alpha$.