Solved – Probability that Null Hypothesis is True

bayesianhypothesis testingprobability

So, this may be a common question, but I’ve never found a satisfactory answer.

How do you determine the probability that the null hypothesis is true (or false)?

Let’s say you give students two different versions of a test and want to see if the versions were equivalent. You perform a t-Test and it gives a p-value of .02. What a nice p-value! That must mean it’s unlikely that the tests are equivalent, right? No. Unfortunately, it appears that P(results|null) doesn’t tell you P(null|results). The normal thing to do is to reject the null hypothesis when we encounter a low p-value, but how do we know that we are not rejecting a null hypothesis that is very likely true? To give a silly example, I can design a test for ebola with a false positive rate of .02: put 50 balls in a bucket and write “ebola” on one. If I test someone with this and they pick the “ebola” ball, the p-value (P(picking the ball|they don’t have ebola)) is .02, but I definitely shouldn’t reject the null hypothesis that they are ebola-free.

Things I’ve considered so far:

Assuming P(null|results)~=P(results|null) – clearly false for some important applications.
Accept or reject hypothesis without knowing P(null|results) – Why are we accepting or rejecting them then? Isn’t the whole point that we reject what we think is LIKELY false and accept what is LIKELY true?
Use Bayes’ Theorem – But how do you get your priors? Don’t you end up back in the same place trying to determine them experimentally? And picking them a priori seems very arbitrary.
I found a very similar question here: stats.stackexchange.com/questions/231580/. The one answer here seems to basically say that it doesn't make sense to ask about the probability of a null hypothesis being true since that's a Bayesian question. Maybe I'm a Bayesian at heart, but I can't imagine not asking that question. In fact, it seems that the most common misunderstanding of p-values is that they are the probability of a true null hypothesis. If you really can't ask this question as a frequentist, then my main question is #3: how do you get your priors without getting stuck in a loop?

Edit:
Thank you for all the thoughtful replies. I want to address a couple common themes.

Definition of probability: I'm sure there is a lot of literature on this, but my naive conception is something like "the belief that a perfectly rational being would have given the information" or "the betting odds that would maximize profit if the situation was repeated and unknowns were allowed to vary".
Can we ever know P(H0|results)? Certainly, this seems to be a tough question. I believe though, that every probability is theoretically knowable, since probability is always conditional on the given information. Every event will either happen or not happen, so probability doesn't exist with full information. It only exists when there is insufficient information, so it should be knowable. For example, if I am told that someone has a coin and asked the probability of heads, I would say 50%. It may happen that the coin is weighted 70% to heads, but I wasn't given that information, so the probability WAS 50% for the info I had, just as even though it happens to land on tails, the probability WAS 70% heads when I learned that. Since probability is always conditional on a set of (insufficient) data, one can never not have enough data to determine it and so it should always be (theoretically) knowable.
Edit: "Always" may be a little too strong. There may be some philosophical questions for which we can't determine probability. Still, in real-world situations, while we can "almost never" have absolute certainty, there should "almost always" be a best estimate.

Best Answer

You have certainly identified an important problem and Bayesianism is one attempt at solving it. You can choose an uninformative prior if you wish. I will let others fill in more about the Bayes approach.

However, in the vast majority of circumstances, you know the null is false in the population, you just don't know how big the effect is. For example, if you make up a totally ludicrous hypothesis - e.g. that a person's weight is related to whether their SSN is odd or even - and you somehow manage to get accurate information from the entire population, the two means will not be exactly equal. They will (probably) differ by some insignificant amount, but they won't match exactly. ' If you go this route, you will deemphasize p values and significance tests and spend more time looking at the estimate of effect size and its accuracy. So, if you have a very large sample, you might find that people with odd SSN weigh 0.001 pounds more than people with even SSN, and that the standard error for this estimate is 0.000001 pounds, so p < 0.05 but no one should care.

Related Solutions

Hypothesis Testing – Steps After Rejecting the Null Hypothesis

You can generally continue to improve your estimate of whatever parameter you might be testing with more data. Stopping data collection once a test achieves some semi-arbitrary degree of significance is a good way to make bad inferences. That analysts may misunderstand a significant result as a sign that the job is done is one of many unintended consequences of the Neyman–Pearson framework, according to which people interpret p values as cause to either reject or fail to reject a null without reservation depending on which side of the critical threshold they fall on.

Without considering Bayesian alternatives to the frequentist paradigm (hopefully someone else will), confidence intervals continue to be more informative well beyond the point at which a basic null hypothesis can be rejected. Assuming collecting more data would just make your basic significance test achieve even greater significance (and not reveal that your earlier finding of significance was a false positive), you might find this useless because you'd reject the null either way. However, in this scenario, your confidence interval around the parameter in question would continue to shrink, improving the degree of confidence with which you can describe your population of interest precisely.

Here's a very simple example in r – testing the null hypothesis that $\mu=0$ for a simulated variable:

One Sample t-test

data:  rnorm(99) 
t = -2.057, df = 98, p-value = 0.04234
alternative hypothesis: true mean is not equal to 0 
95 percent confidence interval:
 -0.377762241 -0.006780574 
sample estimates:
 mean of x 
-0.1922714

Here I just used t.test(rnorm(99)), and I happened to get a false positive (assuming I've defaulted to $\alpha=.05$ as my choice of acceptable false positive error rate). If I ignore the confidence interval, I can claim my sample comes from a population with a mean that differs significantly from zero. Technically the confidence interval doesn't dispute this either, but it suggests that the mean could be very close to zero, or even further from it than I think based on this sample. Of course, I know the null is actually literally true here, because the mean of the rnorm population defaults to zero, but one rarely knows with real data.

Running this again as set.seed(8);t.test(rnorm(99,1)) produces a sample mean of .91, a p = 5.3E-13, and a 95% confidence interval for $\mu=[.69,1.12]$. This time I can be quite confident that the null is false, especially because I constructed it to be by setting the mean of my simulated data to 1.

Still, say it's important to know how different from zero it is; maybe a mean of .8 would be too close to zero for the difference to matter. I can see I don't have enough data to rule out the possibility that $\mu=.8$ from both my confidence interval and from a t-test with mu=.8, which gives a p = .33. My sample mean is high enough to seem meaningfully different from zero according to this .8 threshold though; collecting more data can help improve my confidence that the difference is at least this large, and not just trivially larger than zero.

Since I'm "collecting data" by simulation, I can be a little unrealistic and increase my sample size by an order of magnitude. Running set.seed(8);t.test(rnorm(999,1),mu=.8) reveals that more data continue to be useful after rejecting the null hypothesis of $\mu=0$ in this scenario, because I can now reject a null of $\mu=.8$ with my larger sample. The confidence interval of $\mu=[.90,1.02]$ even suggests I could've rejected null hypotheses up to $\mu=.89$ if I'd set out to do so initially.

I can't revise my null hypothesis after the fact, but without collecting new data to test an even stronger hypothesis after this result, I can say with 95% confidence that replicating my "study" would allow me to reject a $H_0:\mu=.9$. Again, just because I can simulate this easily, I'll rerun the code as set.seed(9);t.test(rnorm(999,1),mu=.9): doing so demonstrates my confidence wasn't misplaced.

Testing progressively more stringent null hypotheses, or better yet, simply focusing on shrinking your confidence intervals is just one way to proceed. Of course, most studies that reject null hypotheses lay the groundwork for other studies that build on the alternative hypothesis. E.g., if I was testing an alternative hypothesis that a correlation is greater than zero, I could test for mediators or moderators in a follow-up study next...and while I'm at it, I'd definitely want to make sure I could replicate the original result.

Another approach to consider is equivalence testing. If you want to conclude that a parameter is within a certain range of possible values, not just different from a single value, you can specify that range of values you'd want the parameter to lie within according to your conventional alternative hypothesis and test it against a different set of null hypotheses that together represent the possibility that the parameter lies outside that range. This last possibility might be most similar to what you had in mind when you wrote:

We have "some evidence" for the alternative to be true, but we can't draw that conclusion. If I really want to draw that conclusion conclusively...

Here's an example using similar data as above (using set.seed(8), rnorm(99) is the same as rnorm(99,1)-1, so the sample mean is -.09). Say I want to test the null hypothesis of two one-sided t-tests that jointly posit that the sample mean is not between -.2 and .2. This corresponds loosely to the previous example's premise, according to which I wanted to test if $\mu=.8$. The difference is that I've shifted my data down by 1, and I'm now going to perform two one-sided tests of the alternative hypothesis that $-.2\le\mu\le.2$. Here's how that looks:

require(equivalence);set.seed(8);tost(rnorm(99),epsilon=.2)

tost sets the confidence level of the interval to 90%, so the confidence interval around the sample mean of -.09 is $\mu=[-.27,.09]$, and p = .17. However, running this again with rnorm(999) (and the same seed) shrinks the 90% confidence interval to $\mu=[-.09,.01]$, which is within the equivalence range specified in the null hypothesis with p = 4.55E-07.

I still think the confidence interval is more interesting than the equivalence test result. It represents what the data suggest the population mean is more specifically than the alternative hypothesis, and suggests I can be reasonably confident that it lies within an even smaller interval than I've specified in the alternative hypothesis. To demonstrate, I'll abuse my unrealistic powers of simulation once more and "replicate" using set.seed(7);tost(rnorm(999),epsilon=.09345092): sure enough, p = .002.

Solved – Statistical significance level – hypothesis testing

Your understanding is mostly correct. Let $X$ be a random variable that follows the same distribution as your test statistic under the null hypothesis. The p value is the probability that a randomly drawn $X$ is at least as large as the test statistic you computed. If that probability is very low, then that is good reason to believe that the null hypothesis does not hold.

You just need to be careful about the difference in terminology between p value and significance level. A significance level is a pre-specified cutoff p value, below which you reject the null hypothesis and above which you do not have enough evidence to reject the null hypothesis. The p value itself is just a probability-valued function of the test statistic that gets smaller as the test statistic gets more extreme (i.e. the CDF of the distribution of the test statistic under the null).

So the significance level does not determine the probability of rejecting the null hypothesis. The significance level determines the largest probability of rejecting the null that you would consider evidence enough to reject the null. When you set a significance level, you are setting an upper bound, below which you find the probability of observing the null too extreme to believe it was randomly drawn from the null distribution.

You might have been confused by someone talking about type 1 error rates and such. All that stuff means is that, if you run the experiment many times, if the null hypothesis is true ever time, and you set your significance level to $\alpha$, you will reject the null hypothesis $\alpha \times 100$% of the time purely due to random chance. Understanding this can help you set reasonable $\alpha$ levels if you do plan to do null hypothesis testing.

Best Answer

Related Solutions

Hypothesis Testing – Steps After Rejecting the Null Hypothesis

Solved – Statistical significance level – hypothesis testing

Related Question