Hypothesis Testing p-Value – Interpretation of p-Value in Hypothesis Testing

hypothesis testingp-value

I recently came across the paper "The Insignificance of Null Hypothesis Significance Testing", Jeff Gill (1999). The author raised a few common misconceptions regarding hypothesis testing and p-values, about which I have two specific questions:

  1. The p-value is technically $P({\rm observation}|H_{0})$, which, as pointed out by the paper, generally does not tell us anything about $P(H_{0}|{\rm observation})$, unless we happen to know the marginal distributions, which is rarely the case in "everyday" hypothesis testing. When we obtain a small p-value and "reject the null hypothesis," what exactly is the probabilistic statement that we are making, since we cannot say anything about $P(H_{0}|{\rm observation})$?
  2. The second question relates to a particular statement from page 6(652) of the paper:

Since the p-value, or range of p-values indicated by stars, is not set a priori, it is not the long-run probability of making a Type I error but is typically treated as such.

Can anyone help to explain what is meant by this statement?

Best Answer

(Technically, the P-value is the probability of observing data at least as extreme as that actually observed, given the null hypothesis.)

Q1. A decision to reject the null hypothesis on the basis of a small P-value typically depends on 'Fisher's disjunction': Either a rare event has happened or the null hypothesis is false. In effect, it is rarity of the event is what the P-value tells you rather than the probability that the null is false.

The probability that the null is false can be obtained from the experimental data only by way of Bayes' theorem, which requires specification of the 'prior' probability of the null hypothesis (presumably what Gill is referring to as "marginal distributions").

Q2. This part of your question is much harder than it might seem. There is a great deal of confusion regarding P-values and error rates which is, presumably, what Gill is referring to with "but is typically treated as such." The combination of Fisherian P-values with Neyman-Pearsonian error rates has been called an incoherent mishmash, and it is unfortunately very widespread. No short answer is going to be completely adequate here, but I can point you to a couple of good papers (yes, one is mine). Both will help you make sense of the Gill paper.

Hurlbert, S., & Lombardi, C. (2009). Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Annales Zoologici Fennici, 46(5), 311–349. (Link to paper)

Lew, M. J. (2012). Bad statistical practice in pharmacology (and other basic biomedical disciplines): you probably don't know P. British Journal of Pharmacology, 166(5), 1559–1567. doi:10.1111/j.1476-5381.2012.01931.x (Link to paper)

Related Question