Solved – Is it possible to accept the alternative hypothesis

hypothesis testing

I'm aware of several related questions here (e.g., Hypothesis testing terminology surrounding null, Is it possible to prove a null hypothesis?) but I don't know the definitive answer for my question below.

Suppose a hypothesis test where we want to test whether a coin is fair or not. We have two hypotheses:

$H_0: p(head)=0.5$

$H_1: p(head)\neq0.5$

Suppose we use 5% significance level, there are two possible cases:

When we obtain the data and find that the p-value is less than
0.05, we say "With significance level 5%, we reject $H_0$."
p-value is greater than 0.05, then we say "With 5% significance
level, we cannot reject $H_0$."

My question is:

In case 1, is it correct to say "we accept $H_1$"?

Intuitively, and from what I have learned in the past, I feel that "accepting" anything as a result of hypothesis testing is always incorrect. On the other hand, in this case, since the union on $H_0$ of $H_1$ covers the whole "space", "reject $H_0$" and "accepting $H_1$" look exactly the same to me. On another thought, I can also think of the following idea, which says it is incorrect to say "we accept $H_1$":

We have an evidence strong enough to believe that $H_0$ is not true, but we may not have an evidence strong enough to believe that $H_1$ is true. Therefore, "rejecting $H_0$" does not automatically imply "accepting $H_1$"

So, what is the right answer?

Best Answer

IMO (as not-a-logician or formally trained statistician per se), one shouldn't take any of this language too seriously. Even rejecting a null when p < .001 doesn't make the null false without a doubt. What's the harm in "accepting" the alternative hypothesis in a similarly provisional sense then? It strikes me as a safer interpretation than "accepting the null" in the opposite scenario (i.e., a large, insignificant p), because the alternative hypothesis is so much less specific. E.g., given $\alpha=.05$, if p = .06, there's still a 94% chance that future studies would find an effect that's at least as different from the null*, so accepting the null isn't a smart bet even if one cannot reject the null. Conversely, if p = .04, one can reject the null, which I've always understood to imply favoring the alternative. Why not "accepting"? The only reason I can see is the fact that one could be wrong, but the same applies when rejecting.

The alternative isn't a particularly strong claim, because as you say, it covers the whole "space". To reject your null, one must find a reliable effect on either side of the null such that the confidence interval doesn't include the null. Given such a confidence interval (CI), the alternative hypothesis is true of it: all values within are unequal to the null. The alternative hypothesis is also true of values outside the CI but more different from the null than the most extremely different value within the CI (e.g., if $\rm CI_{95\%}=[.6,.8]$, it wouldn't even be a problem for the alternative hypothesis if $\mathbb P(\rm head)=.9$). If you can get a CI like that, then again, what's not to accept about it, let alone the alternative hypothesis?

There might be some argument of which I'm unaware, but I doubt I'd be persuaded. Pragmatically, it might be wise not to write that you're accepting the alternative if there are reviewers involved, because success with them (as with people in general) often depends on not defying expectations in unwelcome ways. There's not much at stake anyway if you're not taking "accept" or "reject" too strictly as the final truth of the matter. I think that's the more important mistake to avoid in any case.

It's also important to remember that the null can be useful even if it's probably untrue. In the first example I mentioned where p = .06, failing to reject the null isn't the same as betting that it's true, but it's basically the same as judging it scientifically useful. Rejecting it is basically the same as judging the alternative to be more useful. That seems close enough to "acceptance" to me, especially since it isn't much of a hypothesis to accept.

BTW, this is another argument for focusing on CIs: if you can reject the null using Neyman–Pearson-style reasoning, then it doesn't matter how much smaller than $\alpha$ the p is for the sake of rejecting the null. It may matter by Fisher's reasoning, but if you can reject the null at a level of $\alpha$ that works for you, then it might be more useful to carry that $\alpha$ forward in a CI instead of just rejecting the null more confidently than you need to (a sort of statistical "overkill"). If you have a comfortable error rate $\alpha$ in advance, try using that error rate to describe what you think the effect size could be within a $\rm CI_{(1-\alpha)}$. This is probably more useful than accepting a more vague alternative hypothesis for most purposes.

^{* Another important point about the interpretation of this example p value is that it represents this chance for the scenario in which it is given that the null is true. If the null is untrue as evidence would seem to suggest in this case (albeit not persuasively enough for conventional scientific standards), then that chance is even greater. In other words, even if the null is true (but one doesn't know this), it wouldn't be wise to bet so in this case, and the bet is even worse if it's untrue!}

Related Solutions

Hypothesis Testing – Steps After Rejecting the Null Hypothesis

You can generally continue to improve your estimate of whatever parameter you might be testing with more data. Stopping data collection once a test achieves some semi-arbitrary degree of significance is a good way to make bad inferences. That analysts may misunderstand a significant result as a sign that the job is done is one of many unintended consequences of the Neyman–Pearson framework, according to which people interpret p values as cause to either reject or fail to reject a null without reservation depending on which side of the critical threshold they fall on.

Without considering Bayesian alternatives to the frequentist paradigm (hopefully someone else will), confidence intervals continue to be more informative well beyond the point at which a basic null hypothesis can be rejected. Assuming collecting more data would just make your basic significance test achieve even greater significance (and not reveal that your earlier finding of significance was a false positive), you might find this useless because you'd reject the null either way. However, in this scenario, your confidence interval around the parameter in question would continue to shrink, improving the degree of confidence with which you can describe your population of interest precisely.

Here's a very simple example in r – testing the null hypothesis that $\mu=0$ for a simulated variable:

One Sample t-test

data:  rnorm(99) 
t = -2.057, df = 98, p-value = 0.04234
alternative hypothesis: true mean is not equal to 0 
95 percent confidence interval:
 -0.377762241 -0.006780574 
sample estimates:
 mean of x 
-0.1922714

Here I just used t.test(rnorm(99)), and I happened to get a false positive (assuming I've defaulted to $\alpha=.05$ as my choice of acceptable false positive error rate). If I ignore the confidence interval, I can claim my sample comes from a population with a mean that differs significantly from zero. Technically the confidence interval doesn't dispute this either, but it suggests that the mean could be very close to zero, or even further from it than I think based on this sample. Of course, I know the null is actually literally true here, because the mean of the rnorm population defaults to zero, but one rarely knows with real data.

Running this again as set.seed(8);t.test(rnorm(99,1)) produces a sample mean of .91, a p = 5.3E-13, and a 95% confidence interval for $\mu=[.69,1.12]$. This time I can be quite confident that the null is false, especially because I constructed it to be by setting the mean of my simulated data to 1.

Still, say it's important to know how different from zero it is; maybe a mean of .8 would be too close to zero for the difference to matter. I can see I don't have enough data to rule out the possibility that $\mu=.8$ from both my confidence interval and from a t-test with mu=.8, which gives a p = .33. My sample mean is high enough to seem meaningfully different from zero according to this .8 threshold though; collecting more data can help improve my confidence that the difference is at least this large, and not just trivially larger than zero.

Since I'm "collecting data" by simulation, I can be a little unrealistic and increase my sample size by an order of magnitude. Running set.seed(8);t.test(rnorm(999,1),mu=.8) reveals that more data continue to be useful after rejecting the null hypothesis of $\mu=0$ in this scenario, because I can now reject a null of $\mu=.8$ with my larger sample. The confidence interval of $\mu=[.90,1.02]$ even suggests I could've rejected null hypotheses up to $\mu=.89$ if I'd set out to do so initially.

I can't revise my null hypothesis after the fact, but without collecting new data to test an even stronger hypothesis after this result, I can say with 95% confidence that replicating my "study" would allow me to reject a $H_0:\mu=.9$. Again, just because I can simulate this easily, I'll rerun the code as set.seed(9);t.test(rnorm(999,1),mu=.9): doing so demonstrates my confidence wasn't misplaced.

Testing progressively more stringent null hypotheses, or better yet, simply focusing on shrinking your confidence intervals is just one way to proceed. Of course, most studies that reject null hypotheses lay the groundwork for other studies that build on the alternative hypothesis. E.g., if I was testing an alternative hypothesis that a correlation is greater than zero, I could test for mediators or moderators in a follow-up study next...and while I'm at it, I'd definitely want to make sure I could replicate the original result.

Another approach to consider is equivalence testing. If you want to conclude that a parameter is within a certain range of possible values, not just different from a single value, you can specify that range of values you'd want the parameter to lie within according to your conventional alternative hypothesis and test it against a different set of null hypotheses that together represent the possibility that the parameter lies outside that range. This last possibility might be most similar to what you had in mind when you wrote:

We have "some evidence" for the alternative to be true, but we can't draw that conclusion. If I really want to draw that conclusion conclusively...

Here's an example using similar data as above (using set.seed(8), rnorm(99) is the same as rnorm(99,1)-1, so the sample mean is -.09). Say I want to test the null hypothesis of two one-sided t-tests that jointly posit that the sample mean is not between -.2 and .2. This corresponds loosely to the previous example's premise, according to which I wanted to test if $\mu=.8$. The difference is that I've shifted my data down by 1, and I'm now going to perform two one-sided tests of the alternative hypothesis that $-.2\le\mu\le.2$. Here's how that looks:

require(equivalence);set.seed(8);tost(rnorm(99),epsilon=.2)

tost sets the confidence level of the interval to 90%, so the confidence interval around the sample mean of -.09 is $\mu=[-.27,.09]$, and p = .17. However, running this again with rnorm(999) (and the same seed) shrinks the 90% confidence interval to $\mu=[-.09,.01]$, which is within the equivalence range specified in the null hypothesis with p = 4.55E-07.

I still think the confidence interval is more interesting than the equivalence test result. It represents what the data suggest the population mean is more specifically than the alternative hypothesis, and suggests I can be reasonably confident that it lies within an even smaller interval than I've specified in the alternative hypothesis. To demonstrate, I'll abuse my unrealistic powers of simulation once more and "replicate" using set.seed(7);tost(rnorm(999),epsilon=.09345092): sure enough, p = .002.

Hypothesis Testing – Why Perform Distribution Hypothesis Testing if Null Hypothesis Can’t Be Accepted?

Broadly speaking (not just in goodness of fit testing, but in many other situations), you simply can't conclude that the null is true, because there are alternatives that are effectively indistinguishable from the null at any given sample size.

Here's two distributions, a standard normal (green solid line), and a similar-looking one (90% standard normal, and 10% standardized beta(2,2), marked with a red dashed line):

enter image description here

The red one is not normal. At say $n=100$, we have little chance of spotting the difference, so we can't assert that data are drawn from a normal distribution -- what if it were from a non-normal distribution like the red one instead?

Smaller fractions of standardized betas with equal but larger parameters would be much harder to see as different from a normal.

But given that real data are almost never from some simple distribution, if we had a perfect oracle (or effectively infinite sample sizes), we would essentially always reject the hypothesis that the data were from some simple distributional form.

As George Box famously put it, "All models are wrong, but some are useful."

Consider, for example, testing normality. It may be that the data actually come from something close to a normal, but will they ever be exactly normal? They probably never are.

Instead, the best you can hope for with that form of testing is the situation you describe. (See, for example, the post Is normality testing essentially useless?, but there are a number of other posts here that make related points)

This is part of the reason I often suggest to people that the question they're actually interested in (which is often something nearer to 'are my data close enough to distribution $F$ that I can make suitable inferences on that basis?') is usually not well-answered by goodness-of-fit testing. In the case of normality, often the inferential procedures they wish to apply (t-tests, regression etc) tend to work quite well in large samples - often even when the original distribution is fairly clearly non-normal -- just when a goodness of fit test will be very likely to reject normality. It's little use having a procedure that is most likely to tell you that your data are non-normal just when the question doesn't matter.

Consider the image above again. The red distribution is non-normal, and with a really large sample we could reject a test of normality based on a sample from it ... but at a much smaller sample size, regressions and two sample t-tests (and many other tests besides) will behave so nicely as to make it pointless to even worry about that non-normality even a little.

Similar considerations extend not only to other distributions, but largely, to a large amount of hypothesis testing more generally (even a two-tailed test of $\mu=\mu_0$ for example). One might as well ask the same kind of question - what is the point of performing such testing if we can't conclude whether or not the mean takes a particular value?

You might be able to specify some particular forms of deviation and look at something like equivalence testing, but it's kind of tricky with goodness of fit because there are so many ways for a distribution to be close to but different from a hypothesized one, and different forms of difference can have different impacts on the analysis. If the alternative is a broader family that includes the null as a special case, equivalence testing makes more sense (testing exponential against gamma, for example) -- and indeed, the "two one-sided test" approach carries through, and that might be a way to formalize "close enough" (or it would be if the gamma model were true, but in fact would itself be virtually certain to be rejected by an ordinary goodness of fit test, if only the sample size were sufficiently large).

Goodness of fit testing (and often more broadly, hypothesis testing) is really only suitable for a fairly limited range of situations. The question people usually want to answer is not so precise, but somewhat more vague and harder to answer -- but as John Tukey said, "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise."

Reasonable approaches to answering the more vague question may include simulation and resampling investigations to assess the sensitivity of the desired analysis to the assumption you are considering, compared to other situations that are also reasonably consistent with the available data.

(It's also part of the basis for the approach to robustness via $\varepsilon$-contamination -- essentially by looking at the impact of being within a certain distance in the Kolmogorov-Smirnov sense)

Best Answer

Related Solutions

Hypothesis Testing – Steps After Rejecting the Null Hypothesis

Hypothesis Testing – Why Perform Distribution Hypothesis Testing if Null Hypothesis Can’t Be Accepted?

Related Question