You can generally continue to improve your estimate of whatever parameter you might be testing with more data. Stopping data collection once a test achieves some semi-arbitrary degree of significance is a good way to make bad inferences. That analysts may misunderstand a significant result as a sign that the job is done is one of many unintended consequences of the Neyman–Pearson framework, according to which people interpret p values as cause to either reject or fail to reject a null without reservation depending on which side of the critical threshold they fall on.
Without considering Bayesian alternatives to the frequentist paradigm (hopefully someone else will), confidence intervals continue to be more informative well beyond the point at which a basic null hypothesis can be rejected. Assuming collecting more data would just make your basic significance test achieve even greater significance (and not reveal that your earlier finding of significance was a false positive), you might find this useless because you'd reject the null either way. However, in this scenario, your confidence interval around the parameter in question would continue to shrink, improving the degree of confidence with which you can describe your population of interest precisely.
Here's a very simple example in r – testing the null hypothesis that $\mu=0$ for a simulated variable:
One Sample t-test
data: rnorm(99)
t = -2.057, df = 98, p-value = 0.04234
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.377762241 -0.006780574
sample estimates:
mean of x
-0.1922714
Here I just used t.test(rnorm(99))
, and I happened to get a false positive (assuming I've defaulted to $\alpha=.05$ as my choice of acceptable false positive error rate). If I ignore the confidence interval, I can claim my sample comes from a population with a mean that differs significantly from zero. Technically the confidence interval doesn't dispute this either, but it suggests that the mean could be very close to zero, or even further from it than I think based on this sample. Of course, I know the null is actually literally true here, because the mean of the rnorm
population defaults to zero, but one rarely knows with real data.
Running this again as set.seed(8);t.test(rnorm(99,1))
produces a sample mean of .91, a p = 5.3E-13, and a 95% confidence interval for $\mu=[.69,1.12]$. This time I can be quite confident that the null is false, especially because I constructed it to be by setting the mean of my simulated data to 1.
Still, say it's important to know how different from zero it is; maybe a mean of .8 would be too close to zero for the difference to matter. I can see I don't have enough data to rule out the possibility that $\mu=.8$ from both my confidence interval and from a t-test with mu=.8
, which gives a p = .33. My sample mean is high enough to seem meaningfully different from zero according to this .8 threshold though; collecting more data can help improve my confidence that the difference is at least this large, and not just trivially larger than zero.
Since I'm "collecting data" by simulation, I can be a little unrealistic and increase my sample size by an order of magnitude. Running set.seed(8);t.test(rnorm(999,1),mu=.8)
reveals that more data continue to be useful after rejecting the null hypothesis of $\mu=0$ in this scenario, because I can now reject a null of $\mu=.8$ with my larger sample. The confidence interval of $\mu=[.90,1.02]$ even suggests I could've rejected null hypotheses up to $\mu=.89$ if I'd set out to do so initially.
I can't revise my null hypothesis after the fact, but without collecting new data to test an even stronger hypothesis after this result, I can say with 95% confidence that replicating my "study" would allow me to reject a $H_0:\mu=.9$. Again, just because I can simulate this easily, I'll rerun the code as set.seed(9);t.test(rnorm(999,1),mu=.9)
: doing so demonstrates my confidence wasn't misplaced.
Testing progressively more stringent null hypotheses, or better yet, simply focusing on shrinking your confidence intervals is just one way to proceed. Of course, most studies that reject null hypotheses lay the groundwork for other studies that build on the alternative hypothesis. E.g., if I was testing an alternative hypothesis that a correlation is greater than zero, I could test for mediators or moderators in a follow-up study next...and while I'm at it, I'd definitely want to make sure I could replicate the original result.
Another approach to consider is equivalence testing. If you want to conclude that a parameter is within a certain range of possible values, not just different from a single value, you can specify that range of values you'd want the parameter to lie within according to your conventional alternative hypothesis and test it against a different set of null hypotheses that together represent the possibility that the parameter lies outside that range. This last possibility might be most similar to what you had in mind when you wrote:
We have "some evidence" for the alternative to be true, but we can't draw that conclusion. If I really want to draw that conclusion conclusively...
Here's an example using similar data as above (using set.seed(8)
, rnorm(99)
is the same as rnorm(99,1)-1
, so the sample mean is -.09). Say I want to test the null hypothesis of two one-sided t-tests that jointly posit that the sample mean is not between -.2 and .2. This corresponds loosely to the previous example's premise, according to which I wanted to test if $\mu=.8$. The difference is that I've shifted my data down by 1, and I'm now going to perform two one-sided tests of the alternative hypothesis that $-.2\le\mu\le.2$. Here's how that looks:
require(equivalence);set.seed(8);tost(rnorm(99),epsilon=.2)
tost
sets the confidence level of the interval to 90%, so the confidence interval around the sample mean of -.09 is $\mu=[-.27,.09]$, and p = .17. However, running this again with rnorm(999)
(and the same seed) shrinks the 90% confidence interval to $\mu=[-.09,.01]$, which is within the equivalence range specified in the null hypothesis with p = 4.55E-07.
I still think the confidence interval is more interesting than the equivalence test result. It represents what the data suggest the population mean is more specifically than the alternative hypothesis, and suggests I can be reasonably confident that it lies within an even smaller interval than I've specified in the alternative hypothesis. To demonstrate, I'll abuse my unrealistic powers of simulation once more and "replicate" using set.seed(7);tost(rnorm(999),epsilon=.09345092)
: sure enough, p = .002.
IMO (as not-a-logician or formally trained statistician per se), one shouldn't take any of this language too seriously. Even rejecting a null when p < .001 doesn't make the null false without a doubt. What's the harm in "accepting" the alternative hypothesis in a similarly provisional sense then? It strikes me as a safer interpretation than "accepting the null" in the opposite scenario (i.e., a large, insignificant p), because the alternative hypothesis is so much less specific. E.g., given $\alpha=.05$, if p = .06, there's still a 94% chance that future studies would find an effect that's at least as different from the null*, so accepting the null isn't a smart bet even if one cannot reject the null. Conversely, if p = .04, one can reject the null, which I've always understood to imply favoring the alternative. Why not "accepting"? The only reason I can see is the fact that one could be wrong, but the same applies when rejecting.
The alternative isn't a particularly strong claim, because as you say, it covers the whole "space". To reject your null, one must find a reliable effect on either side of the null such that the confidence interval doesn't include the null. Given such a confidence interval (CI), the alternative hypothesis is true of it: all values within are unequal to the null. The alternative hypothesis is also true of values outside the CI but more different from the null than the most extremely different value within the CI (e.g., if $\rm CI_{95\%}=[.6,.8]$, it wouldn't even be a problem for the alternative hypothesis if $\mathbb P(\rm head)=.9$). If you can get a CI like that, then again, what's not to accept about it, let alone the alternative hypothesis?
There might be some argument of which I'm unaware, but I doubt I'd be persuaded. Pragmatically, it might be wise not to write that you're accepting the alternative if there are reviewers involved, because success with them (as with people in general) often depends on not defying expectations in unwelcome ways. There's not much at stake anyway if you're not taking "accept" or "reject" too strictly as the final truth of the matter. I think that's the more important mistake to avoid in any case.
It's also important to remember that the null can be useful even if it's probably untrue. In the first example I mentioned where p = .06, failing to reject the null isn't the same as betting that it's true, but it's basically the same as judging it scientifically useful. Rejecting it is basically the same as judging the alternative to be more useful. That seems close enough to "acceptance" to me, especially since it isn't much of a hypothesis to accept.
BTW, this is another argument for focusing on CIs: if you can reject the null using Neyman–Pearson-style reasoning, then it doesn't matter how much smaller than $\alpha$ the p is for the sake of rejecting the null. It may matter by Fisher's reasoning, but if you can reject the null at a level of $\alpha$ that works for you, then it might be more useful to carry that $\alpha$ forward in a CI instead of just rejecting the null more confidently than you need to (a sort of statistical "overkill"). If you have a comfortable error rate $\alpha$ in advance, try using that error rate to describe what you think the effect size could be within a $\rm CI_{(1-\alpha)}$. This is probably more useful than accepting a more vague alternative hypothesis for most purposes.
* Another important point about the interpretation of this example p value is that it represents this chance for the scenario in which it is given that the null is true. If the null is untrue as evidence would seem to suggest in this case (albeit not persuasively enough for conventional scientific standards), then that chance is even greater. In other words, even if the null is true (but one doesn't know this), it wouldn't be wise to bet so in this case, and the bet is even worse if it's untrue!
Best Answer
For calculating the probability of a Type I Error, we start with: $$ \begin{equation} \label{eql} \begin{split} \text{Pr}(\text{Type I Error}) & = \text{Pr}(\text{reject }H_0 | H_0 \text{ is true}) \\ & = \text{Pr}(\text{reject }H_0 | p=.5, n=5) \end{split} \end{equation} $$
The probability mass function $\text{Pr}(X=x)=\binom{5}{x}.5^x .5^{5-x}$ (note that your pmf incorrectly uses $1-p=.95$) for a binomial random variable $X$ given our $H_0$ ($p=.5,n=5$) is: $$ \begin{split} \text{Pr}(X=0) = \frac{1}{32} = .03125 \\ \text{Pr}(X=1) = \frac{5}{32} = .15625 \\ \text{Pr}(X=2) = \frac{5}{16} = .31250 \\ \text{Pr}(X=3) = \frac{5}{16} = .31250 \\ \text{Pr}(X=4) = \frac{5}{32} = .15625 \\ \text{Pr}(X=5) = \frac{1}{32} = .03125 \end{split} $$
Noting above that only $\text{Pr}(X=0)$ and $\text{Pr}(X=5)$ are below our $\alpha=.05$ threshold, and therefore that $H_0$ may only be rejected if a sample results in $X=0$ or $X=5$, we can move forward as follows:
$$ \begin{equation} \label{eql1} \begin{split} \text{Pr}(\text{Type I Error}) & = \text{Pr}(\text{reject }H_0 | p=.5,n=5) \\ & = \text{Pr}(X=0| p=.5,n=5) + \text{Pr}(X=5| p=.5,n=5) \\ & =2\cdot.03125=.0625=\frac{1}{16} \end{split} \end{equation} $$