Solved – the correct way to interpret a $95\%$ confidence interval from the t.test function in R

confidence intervalr

Suppose that in R, we did a t-test with some sample data:

> t.test(1:10, y = c(7:20))

    Welch Two Sample t-test

data:  1:10 and c(7:20)
t = -5.4349, df = 21.982, p-value = 1.855e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.052802  -4.947198
sample estimates:
mean of x mean of y 
      5.5      13.5 

The above yields a $95\%$ confidence interval. My general understanding of a level $\alpha$ confidence interval is that if we repeated an experiment an infinitely large number of times, $(1-\alpha)\%$ of those constructed intervals would contain the true population mean.

However, the above appears to be a "point estimate" of the confidence interval. What is the correct way to interpret the range, $(-11.052802, -4.947198)$ above? Thanks!

Best Answer

The Neyman-Pearson theory of Null Hypothesis Significance Testing has the goal of providing you with a decision rule which, when the null hypothesis is true, allows you to make the correct choice $95\%$ of times. However, it cannot tell you the confidence interval you computed based on your random sample drawn from the population (i.e., your realization $i=(−11.052802,−4.947198)$ of the random interval $I$) contains the true parameter or not: you just don't know.

Then why do you reject the null hypothesis $H_0$ in this specific case? You do it because you know that, if $H_0$ were true and you repeated this experiment a large number of times, then, following the Neyman-Pearson decision rule, which is:

  • accept $H_0$ if $i$ contains 0
  • reject $H_0$ if $i$ doesn't contain 0 (as in your case)

you would be wrong only $5\%$ of times. Thus the decision rule is a guide to control your error rate in the long run.

This is very relevant in manufacturing for example, in process quality control. If the manufacturing process is in control, you are effectively sampling multiple times from the same population, thus you expect $5\%$ of your confidence intervals not to contain the parameter of interest. Thus a process in control would raise an alarm $5\%$ of times, which can sound weird (actually, the $\alpha$ level used in quality control is usually much less than $5\%$).

You acutely asked in a comment why not to compute multiple confidence intervals, instead. First of all, in real life, you often can't afford the luxury of performing repeated sampling, because of time, budget, etc. constraints. Secondly, even if you could, it wouldn't make sense to create many such intervals and try to "intersect" them: there's no principled way to do that. Instead, you can gather all your $m$ random samples of size $n$ together and build a confidence interval based on your aggregated sample $\mathbf{x}=(x_{11},\dots,x_{1n},\dots,x_{1m},\dots,x_{nm})$. Since the width of a confidence interval decreases with the sample size $N$ (usually, as $O(\frac{1}{\sqrt{N}})$), the resulting confidence interval will be your most accurate inference (but you still won't be able to know with certainty if it contains the true parameter or not).