False Positives – Why Are p-Values Between 0.05 and 0.95 Considered False Positives?

hypothesis testingp-value

Edit: The basis of my question is flawed, and I need to spend some time figuring out whether it can even be made to make sense.

Edit 2: Clarifying that I recognize that a p-value isn't a direct measure of the probability of a null hypothesis, but that I assume that the closer a p-value is to 1, the more likely it is that a hypothesis has been chosen for experimental testing whose corresponding null hypothesis is true, while the closer a p-value is to 0, the more likely it is that a hypothesis has been chosen for experimental testing whose corresponding null hypothesis is false. I can't see how this is false unless the set of all hypotheses (or all hypotheses picked for experiments) is somehow pathological.

Edit 3: I think I'm still not using clear terminology to ask my question. As lottery numbers are read out, and you match them to your ticket one-by-one, something changes. The probability that you have won does not change, but the probability that you can turn the radio off does. There's a similar change that happens when experiments are done, but I have a feeling that the terminology I'm using – "p-values change the likelihood that a true hypothesis has been chosen" – isn't the correct terminology.

Edit 4: I've received two amazingly detailed and informative answers that contain a wealth of information for me to work through. I'll vote them both up now and then come back to accept one when I've learned enough from both answers to know that they've either answered or invalidated my question. This question opened a much bigger can of worms than the one I was expecting to eat.

In papers I've read, I've seen results with p > 0.05 after validation called "false positives". However, isn't it still more likely than not that I've chosen a hypothesis to test with a false corresponding null hypothesis when the experimental data has a p ~~< 0.50~~ which is low but > 0.05, and aren't both the null hypothesis and the research hypothesis statistically uncertain/insignificant (given the conventional statistical significance cutoff) anywhere between 0.05 < p < ~~0.95~~ whatever the inverse of p < 0.05 is, given the asymmetry pointed out in @NickStauner's link?

Let's call that number A, and define it as the p-value which says the same thing about the likelihood that you've picked a true null hypothesis for your experiment/analysis that a p-value of 0.05 says about the likelihood that you've picked a true non-null hypothesis for your experiment/analysis. Doesn't 0.05 < p < A just say, "Your sample size wasn't big enough to answer the question, and you won't be able to judge application/real-world significance until you get a bigger sample and get your statistical significance sorted out"?

In other words, shouldn't it be correct to call a result definitely false (rather than simply unsupported) if and only if p > A?

This seems straightforward to me, but such widespread usage tells me that I might be wrong. Am I:

a) misinterpreting the mathematics,
b) complaining about a harmless-if-not-exactly-correct convention,
c) completely correct, or
d) other?

I recognize that this sounds like a call for opinions, but this seems like a question with a definite mathematically correct answer (once a significance cutoff is set) that either I or (almost) everybody else is getting wrong.

Best Answer

Your question is based on a false premise:

isn't the null hypothesis still more likely than not to be wrong when p < 0.50

A p-value is not a probability that the null hypothesis is true. For example, if you took a thousand cases where the null hypothesis is true, half of them will have p < .5. Those half will all be null.

Indeed, the idea that p > .95 means that the null hypothesis is "probably true" is equally misleading. If the null hypothesis is true, the probability that p > .95 is exactly the same as the probability that p < .05.

ETA: Your edit makes it clearer what the issue is: you still do have the issue above (that you're treating a p-value as a posterior probability, when it is not). It's important to note that this is not a subtle philosophical distinction (as I think you're implying with your discussion of the lottery tickets): it has enormous practical implications for any interpretation of p-values.

But there is a transformation you can perform on p-values that will get you to what you're looking for, and it's called the local false discovery rate. (As described by this nice paper, it's the frequentist equivalent of the "posterior error probability", so think of it that way if you like).

Let's work with a concrete example. Let's say you are performing a t-test to determine whether a sample of 10 numbers (from a normal distribution) has a mean of 0 (a one-sample, two-sided t-test). First, let's see what the p-value distribution looks like when the mean actually is zero, with a short R simulation:

null.pvals = replicate(10000, t.test(rnorm(10, mean=0, sd=1))$p.value)
hist(null.pvals)

enter image description here

As we can see, null p-values have a uniform distribution (equally likely at all points between 0 and 1). This is a necessary condition of p-values: indeed, it's precisely what p-values mean! (Given the null is true, there is a 5% chance it is less than .05, a 10% chance it is less than .1...)

Now let's consider the alternative hypothesis- cases where the null is false. Now, this is a bit more complicated: when the null is false, "how false" is it? The mean of the sample isn't 0, but is it .5? 1? 10? Does it randomly vary, sometimes small and sometimes large? For simplicity's sake, let's say it is always equal to .5 (but remember that complication, it'll be important later):

alt.pvals = replicate(10000, t.test(rnorm(10, mean=.5, sd=1))$p.value)
hist(alt.pvals)

enter image description here

Notice that the distribution is now not uniform: it is shifted towards 0! In your comment you mention an "asymmetry" that gives information: this is that asymmetry.

So imagine you knew both of those distributions, but you're working with a new experiment, and you also have a prior that there's a 50% chance it's null and 50% that it's alternative. You get a p-value of .7. How can you get from that and the p-value to a probability?

What you should do is compare densities:

lines(density(alt.pvals, bw=.02))
plot(density(null.pvals, bw=.02))

And look at your p-value:

abline(v=.7, col="red", lty=2)

enter image description here

That ratio between the null density and the alternative density can be used to calculate the local false discovery rate: the higher the null is relative to the alternative, the higher the local FDR. That's the probability that the hypothesis is null (technically it has a stricter frequentist interpretation, but we'll keep it simple here). If that value is very high, then you can make the interpretation "the null hypothesis is almost certainly true." Indeed, you can make a .05 and .95 threshold of the local FDR: this would have the properties you're looking for. (And since local FDR increases monotonically with p-value, at least if you're doing it right, these will translate to some thresholds A and B where you can say "between A and B we are unsure").

Now, I can already hear you asking "then why don't we use that instead of p-values?" Two reasons:

You need to decide on a prior probability that the test is null
You need to know the density under the alternative. This is very difficult to guess at, because you need to determine how large your effect sizes and variances can be, and how often they are so!

You do not need either of those for a p-value test, and a p-value test still lets you avoid false positives (which is its primary purpose). Now, it is possible to estimate both of those values in multiple hypothesis tests, when you have thousands of p-values (such as one test for each of thousands of genes: see this paper or this paperfor instance), but not when you're doing a single test.

Finally, you might say "Isn't the paper still wrong to say a replication that leads to a p-value above .05 is necessarily a false positive?" Well, while it's true that getting one p-value of .04 and another p-value of .06 doesn't really mean the original result was wrong, in practice it's a reasonable metric to pick. But in any case, you might be glad to know others have their doubts about it! The paper you refer to is somewhat controversial in statistics: this paper uses a different method and comes to a very different conclusion about the p-values from medical research, and then that study was criticized by some prominent Bayesians (and round and round it goes...). So while your question is based on some faulty presumptions about p-values, I think it does examine an interesting assumption on the part of the paper you cite.

Related Solutions

Hypothesis Testing – Comparing and Contrasting P-Values, Significance Levels, and Type I Error

The question looks simple, but your reflection around it shows that it is not that simple.

Actually, p-values are a relatively late addition to the theory of statistics. Computing a p-value without a computer is very tedious; this is why the only way to perform a statistical test until recently was to use tables of statistical tests, as I explain in this blog post. Because those tables were computed for fixed $\alpha$ levels (typically 0.05, 0.01 and 0.001) you could only perform a test with those levels.

Computers made those tables useless, but the logic of testing is still the same. You should:

Formulate a null hypothesis.
Formulate an alternative hypothesis.
Decide a maximum type I error (the probability of falsely rejecting the null hypothesis) error you are ready to accept.
Design a rejection region. The probability that the test statistic falls in the rejection region given that the null hypothesis is your level $\alpha$. As @MånsT explains, this should be no smaller than your acceptable type I error, and in many cases use asymptotic approximations.
Carry out the random experiment, compute the test statistic and see whether it falls in the rejection region.

In theory, there is a strict equivalence between the events "the statistic falls in the rejection region" and "the p-value is less than $\alpha$", which is why it is felt that you can report the p-value instead. In practice, it allows you to skip step 3. and evaluate the type I error after the test is done.

To come back to your post, the statement of the null hypothesis is incorrect. The null hypothesis is that the probability of flipping a head is $1/2$ (the null hypothesis cannot pertain to the results of the random experiment).

If you repeat the experiment again and again with a threshold p-value of 0.05, yes, you should have approximately 5% rejection. And if you set a p-value cut-off of 0.06, you should end up with roughly 6% rejection. More generally, for continuous tests, by definition of the p-value $p$

$$ Prob(p < x) = x, \, (0 < x < 1), $$

which is only approximately true for discrete tests.

Here is some R code that I hope can clarify this a bit. The binomial test is relatively slow, so I do only 10,000 random experiments in which I flip 1000 coins. I perform a binomial test and collect the 10,000 p-values.

set.seed(123)
# Generate 10,000 random experiments of each 1000 coin flipping
rexperiments <- rbinom(n=10000, size=1000, prob=0.5)
all_p_values <- rep(NA, 10000)
for (i in 1:10000) {
    all_p_values[i] <- binom.test(rexperiments[i], 1000)$p.value
}
# Plot the cumulative density of p-values.
plot(ecdf(all_p_values))
# How many are less than 0.05?
mean(all_p_values < 0.05)
# [1] 0.0425
# How many are less than 0.06?
mean(all_p_values < 0.06)
# 0.0491

You can see that the proportions are not exact, because the sample size is not infinite and the test is discrete, but there is still an increase of roughly 1% between the two.

Solved – Statistical significance level – hypothesis testing

Your understanding is mostly correct. Let $X$ be a random variable that follows the same distribution as your test statistic under the null hypothesis. The p value is the probability that a randomly drawn $X$ is at least as large as the test statistic you computed. If that probability is very low, then that is good reason to believe that the null hypothesis does not hold.

You just need to be careful about the difference in terminology between p value and significance level. A significance level is a pre-specified cutoff p value, below which you reject the null hypothesis and above which you do not have enough evidence to reject the null hypothesis. The p value itself is just a probability-valued function of the test statistic that gets smaller as the test statistic gets more extreme (i.e. the CDF of the distribution of the test statistic under the null).

So the significance level does not determine the probability of rejecting the null hypothesis. The significance level determines the largest probability of rejecting the null that you would consider evidence enough to reject the null. When you set a significance level, you are setting an upper bound, below which you find the probability of observing the null too extreme to believe it was randomly drawn from the null distribution.

You might have been confused by someone talking about type 1 error rates and such. All that stuff means is that, if you run the experiment many times, if the null hypothesis is true ever time, and you set your significance level to $\alpha$, you will reject the null hypothesis $\alpha \times 100$% of the time purely due to random chance. Understanding this can help you set reasonable $\alpha$ levels if you do plan to do null hypothesis testing.

Best Answer

Related Solutions

Hypothesis Testing – Comparing and Contrasting P-Values, Significance Levels, and Type I Error

Solved – Statistical significance level – hypothesis testing

Related Question