You can generally continue to improve your estimate of whatever parameter you might be testing with more data. Stopping data collection once a test achieves some semi-arbitrary degree of significance is a good way to make bad inferences. That analysts may misunderstand a significant result as a sign that the job is done is one of many unintended consequences of the Neyman–Pearson framework, according to which people interpret p values as cause to either reject or fail to reject a null without reservation depending on which side of the critical threshold they fall on.
Without considering Bayesian alternatives to the frequentist paradigm (hopefully someone else will), confidence intervals continue to be more informative well beyond the point at which a basic null hypothesis can be rejected. Assuming collecting more data would just make your basic significance test achieve even greater significance (and not reveal that your earlier finding of significance was a false positive), you might find this useless because you'd reject the null either way. However, in this scenario, your confidence interval around the parameter in question would continue to shrink, improving the degree of confidence with which you can describe your population of interest precisely.
Here's a very simple example in r – testing the null hypothesis that $\mu=0$ for a simulated variable:
One Sample t-test
data: rnorm(99)
t = -2.057, df = 98, p-value = 0.04234
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.377762241 -0.006780574
sample estimates:
mean of x
-0.1922714
Here I just used t.test(rnorm(99))
, and I happened to get a false positive (assuming I've defaulted to $\alpha=.05$ as my choice of acceptable false positive error rate). If I ignore the confidence interval, I can claim my sample comes from a population with a mean that differs significantly from zero. Technically the confidence interval doesn't dispute this either, but it suggests that the mean could be very close to zero, or even further from it than I think based on this sample. Of course, I know the null is actually literally true here, because the mean of the rnorm
population defaults to zero, but one rarely knows with real data.
Running this again as set.seed(8);t.test(rnorm(99,1))
produces a sample mean of .91, a p = 5.3E-13, and a 95% confidence interval for $\mu=[.69,1.12]$. This time I can be quite confident that the null is false, especially because I constructed it to be by setting the mean of my simulated data to 1.
Still, say it's important to know how different from zero it is; maybe a mean of .8 would be too close to zero for the difference to matter. I can see I don't have enough data to rule out the possibility that $\mu=.8$ from both my confidence interval and from a t-test with mu=.8
, which gives a p = .33. My sample mean is high enough to seem meaningfully different from zero according to this .8 threshold though; collecting more data can help improve my confidence that the difference is at least this large, and not just trivially larger than zero.
Since I'm "collecting data" by simulation, I can be a little unrealistic and increase my sample size by an order of magnitude. Running set.seed(8);t.test(rnorm(999,1),mu=.8)
reveals that more data continue to be useful after rejecting the null hypothesis of $\mu=0$ in this scenario, because I can now reject a null of $\mu=.8$ with my larger sample. The confidence interval of $\mu=[.90,1.02]$ even suggests I could've rejected null hypotheses up to $\mu=.89$ if I'd set out to do so initially.
I can't revise my null hypothesis after the fact, but without collecting new data to test an even stronger hypothesis after this result, I can say with 95% confidence that replicating my "study" would allow me to reject a $H_0:\mu=.9$. Again, just because I can simulate this easily, I'll rerun the code as set.seed(9);t.test(rnorm(999,1),mu=.9)
: doing so demonstrates my confidence wasn't misplaced.
Testing progressively more stringent null hypotheses, or better yet, simply focusing on shrinking your confidence intervals is just one way to proceed. Of course, most studies that reject null hypotheses lay the groundwork for other studies that build on the alternative hypothesis. E.g., if I was testing an alternative hypothesis that a correlation is greater than zero, I could test for mediators or moderators in a follow-up study next...and while I'm at it, I'd definitely want to make sure I could replicate the original result.
Another approach to consider is equivalence testing. If you want to conclude that a parameter is within a certain range of possible values, not just different from a single value, you can specify that range of values you'd want the parameter to lie within according to your conventional alternative hypothesis and test it against a different set of null hypotheses that together represent the possibility that the parameter lies outside that range. This last possibility might be most similar to what you had in mind when you wrote:
We have "some evidence" for the alternative to be true, but we can't draw that conclusion. If I really want to draw that conclusion conclusively...
Here's an example using similar data as above (using set.seed(8)
, rnorm(99)
is the same as rnorm(99,1)-1
, so the sample mean is -.09). Say I want to test the null hypothesis of two one-sided t-tests that jointly posit that the sample mean is not between -.2 and .2. This corresponds loosely to the previous example's premise, according to which I wanted to test if $\mu=.8$. The difference is that I've shifted my data down by 1, and I'm now going to perform two one-sided tests of the alternative hypothesis that $-.2\le\mu\le.2$. Here's how that looks:
require(equivalence);set.seed(8);tost(rnorm(99),epsilon=.2)
tost
sets the confidence level of the interval to 90%, so the confidence interval around the sample mean of -.09 is $\mu=[-.27,.09]$, and p = .17. However, running this again with rnorm(999)
(and the same seed) shrinks the 90% confidence interval to $\mu=[-.09,.01]$, which is within the equivalence range specified in the null hypothesis with p = 4.55E-07.
I still think the confidence interval is more interesting than the equivalence test result. It represents what the data suggest the population mean is more specifically than the alternative hypothesis, and suggests I can be reasonably confident that it lies within an even smaller interval than I've specified in the alternative hypothesis. To demonstrate, I'll abuse my unrealistic powers of simulation once more and "replicate" using set.seed(7);tost(rnorm(999),epsilon=.09345092)
: sure enough, p = .002.
Answer to question 1: This occurs because the $p$-value becomes arbitrarily small as the sample size increases in frequentist tests for difference (i.e. tests with a null hypothesis of no difference/some form of equality) when a true difference exactly equal to zero, as opposed to arbitraily close to zero, is not realistic (see Nick Stauner's comment to the OP). The $p$-value becomes arbitrarily small because the error of frequentist test statistics generally decreases with sample size, with the upshot that all differences are significant to an arbitrary level with a large enough sample size. Cosma Shalizi has written eruditely about this.
Answer to question 2: Within a frequentist hypothesis testing framework, one can guard against this by not making inference solely about detecting difference. For example, one can combine inferences about difference and equivalence so that one is not favoring (or conflating!) the burden of proof on evidence of effect versus evidence of absence of effect. Evidence of absence of an effect comes from, for example:
- two one-sided tests for equivalence (TOST),
- uniformly most powerful tests for equivalence, and
- the confidence interval approach to equivalence (i.e. if the $1-2\alpha$%CI of the test statistic is within the a priori-defined range of equivalence/relevance, then one concludes equivalence at the $\alpha$ level of significance).
What these approaches all share is an a priori decision about what effect size constitutes a relevant difference and a null hypothesis framed in terms of a difference at least as large as what is considered relevant.
Combined inference from tests for difference and tests for equivalence thus protects against the bias you describe when sample sizes are large in this way (two-by-two table showing the four possibilities resulting from combined tests for difference—positivist null hypothesis, $\text{H}_{0}^{+}$—and equivalence—negativist null hypothesis, $\text{H}_{0}^{-}$):
Notice the upper left quadrant: an overpowered test is one where yes you reject the null hypothesis of no difference, but you also reject the null hypothesis of relevant difference, so yes there's a difference, but you have a priori decided you do not care about it because it is too small.
Answer to question 3: See answer to 2.
Best Answer
Failing to reject a null hypothesis is evidence that the null hypothesis is true, but it might not be particularly good evidence, and it certainly doesn't prove the null hypothesis.
Let's take a short detour. Consider for a moment the old cliché:
Notwithstanding its popularity, this statement is nonsense. If you look for something and fail to find it, that is absolutely evidence that it isn't there. How good that evidence is depends on how thorough your search was. A cursory search provides weak evidence; an exhaustive search provides strong evidence.
Now, back to hypothesis testing. When you run a hypothesis test, you are looking for evidence that the null hypothesis is not true. If you don't find it, then that is certainly evidence that the null hypothesis is true, but how strong is that evidence? To know that, you have to know how likely it is that evidence that would have made you reject the null hypothesis could have eluded your search. That is, what is the probability of a false negative on your test? This is related to the power, $\beta$, of the test (specifically, it is the complement, 1-$\beta$.)
Now, the power of the test, and therefore the false negative rate, usually depends on the size of the effect you are looking for. Large effects are easier to detect than small ones. Therefore, there is no single $\beta$ for an experiment, and therefore no definitive answer to the question of how strong the evidence for the null hypothesis is. Put another way, there is always some effect size small enough that it's not ruled out by the experiment.
From here, there are two ways to proceed. Sometimes you know you don't care about an effect size smaller than some threshold. In that case, you probably should reframe your experiment such that the null hypothesis is that the effect is above that threshold, and then test the alternative hypothesis that the effect is below the threshold. Alternatively, you could use your results to set bounds on the believable size of the effect. Your conclusion would be that the size of the effect lies in some interval, with some probability. That approach is just a small step away from a Bayesian treatment, which you might want to learn more about, if you frequently find yourself in this sort of situation.
There's a nice answer to a related question that touches on evidence of absence testing, which you might find useful.