Solved – the meaning of a large p-value

hypothesis testingp-value

I understand that the $p$-value is the conditional probability of observing the test statistic or something more extreme given that the null hypothesis is true. I have read the great explanation by @user28 in this post: What is the meaning of p values and t values in statistical tests? However, do large $p$-values say anything? Does a larger $p$-value lend greater support to the null hypothesis? If I set rejection region to be $<0.05$, then does it make a difference if I get $p$-value $0.06$ or $0.99$? (After all, $0.05$ is arbitrary, and $0.06$ is so close to being rejected that if I arbitrarily set $0.05$ as $0.1$ instead, the null hypothesis would have been rejected.) Can one make any statistical use of a non-rejecting $p$-value?

Best Answer

How you should 'use' the p-value depends on how you have designed your study with regard to the analyses you will run. I discuss two different philosophies about p-values in my answer here: When to use Fisher and Neyman-Pearson framework? You may find it helpful to read that. If you have, for example, run a power analysis and intend to use the p-value to make a final decision, you should not use close to the line ('marginally significant') as a meaningful category. It is fine to use a different alpha than $0.05$ (such as $0.10$), but once you decided on it and set your study up accordingly, you should stick with it.

In addition, you cannot use a large p-value as evidence for the null hypothesis. I discussed that idea in my answer here: Why do statisticians say a non-significant result means "you cannot reject the null" as opposed to accepting the null hypothesis? Reading that answer may be helpful to you as well.

Related Solutions

Hypothesis Testing – Comparing and Contrasting P-Values, Significance Levels, and Type I Error

The question looks simple, but your reflection around it shows that it is not that simple.

Actually, p-values are a relatively late addition to the theory of statistics. Computing a p-value without a computer is very tedious; this is why the only way to perform a statistical test until recently was to use tables of statistical tests, as I explain in this blog post. Because those tables were computed for fixed $\alpha$ levels (typically 0.05, 0.01 and 0.001) you could only perform a test with those levels.

Computers made those tables useless, but the logic of testing is still the same. You should:

Formulate a null hypothesis.
Formulate an alternative hypothesis.
Decide a maximum type I error (the probability of falsely rejecting the null hypothesis) error you are ready to accept.
Design a rejection region. The probability that the test statistic falls in the rejection region given that the null hypothesis is your level $\alpha$. As @MånsT explains, this should be no smaller than your acceptable type I error, and in many cases use asymptotic approximations.
Carry out the random experiment, compute the test statistic and see whether it falls in the rejection region.

In theory, there is a strict equivalence between the events "the statistic falls in the rejection region" and "the p-value is less than $\alpha$", which is why it is felt that you can report the p-value instead. In practice, it allows you to skip step 3. and evaluate the type I error after the test is done.

To come back to your post, the statement of the null hypothesis is incorrect. The null hypothesis is that the probability of flipping a head is $1/2$ (the null hypothesis cannot pertain to the results of the random experiment).

If you repeat the experiment again and again with a threshold p-value of 0.05, yes, you should have approximately 5% rejection. And if you set a p-value cut-off of 0.06, you should end up with roughly 6% rejection. More generally, for continuous tests, by definition of the p-value $p$

$$ Prob(p < x) = x, \, (0 < x < 1), $$

which is only approximately true for discrete tests.

Here is some R code that I hope can clarify this a bit. The binomial test is relatively slow, so I do only 10,000 random experiments in which I flip 1000 coins. I perform a binomial test and collect the 10,000 p-values.

set.seed(123)
# Generate 10,000 random experiments of each 1000 coin flipping
rexperiments <- rbinom(n=10000, size=1000, prob=0.5)
all_p_values <- rep(NA, 10000)
for (i in 1:10000) {
    all_p_values[i] <- binom.test(rexperiments[i], 1000)$p.value
}
# Plot the cumulative density of p-values.
plot(ecdf(all_p_values))
# How many are less than 0.05?
mean(all_p_values < 0.05)
# [1] 0.0425
# How many are less than 0.06?
mean(all_p_values < 0.06)
# 0.0491

You can see that the proportions are not exact, because the sample size is not infinite and the test is discrete, but there is still an increase of roughly 1% between the two.

Solved – Confused about region of rejection vs P-value.

The significance level is the probability of getting a result in the rejection region, given the null hypothesis is true.

Note that the alternative puts an ordering on your test statistic - the values of the test statistic most in keeping with the alternative are the ones you want in your rejection region.

The p-value is the probability of a test statistic at least as extreme (under that ordering just mentioned) as the one from your sample, if the null hypothesis is true. If the test statistic is in the rejection region, the p-value is smaller than the significance level.

That sounds alot like what a p-value is?

Not really, the rejection region is the set of points where the null would be rejected. The p-value is as stated above, a probability -- and not even the probability associated with that set of points (again, that's the significance level).

Maybe I am confused about what exactly a p-value is.

Maybe. Just check your definitions.

Best Answer

Related Solutions

Hypothesis Testing – Comparing and Contrasting P-Values, Significance Levels, and Type I Error

Solved – Confused about region of rejection vs P-value.

Related Question