Yes, there are some simple relationships between confidence interval comparisons and hypothesis tests in a wide range of practical settings. However, in addition to verifying the CI procedures and t-test are appropriate for our data, we must check that the sample sizes are not too different and that the two sets have similar standard deviations. We also should not attempt to derive highly precise p-values from comparing two confidence intervals, but should be glad to develop effective approximations.
In trying to reconcile the two replies already given (by @John and @Brett), it helps to be mathematically explicit. A formula for a symmetric two-sided confidence interval appropriate for the setting of this question is
$$\text{CI} = m \pm \frac{t_\alpha(n) s}{\sqrt{n}}$$
where $m$ is the sample mean of $n$ independent observations, $s$ is the sample standard deviation, $2\alpha$ is the desired test size (maximum false positive rate), and $t_\alpha(n)$ is the upper $1-\alpha$ percentile of the Student t distribution with $n-1$ degrees of freedom. (This slight deviation from conventional notation simplifies the exposition by obviating any need to fuss over the $n$ vs $n-1$ distinction, which will be inconsequential anyway.)
Using subscripts $1$ and $2$ to distinguish two independent sets of data for comparison, with $1$ corresponding to the larger of the two means, a non-overlap of confidence intervals is expressed by the inequality (lower confidence limit 1) $\gt$ (upper confidence limit 2); viz.,
$$m_1 - \frac{t_\alpha(n_1) s_1}{\sqrt{n_1}} \gt m_2 + \frac{t_\alpha(n_2) s_2}{\sqrt{n_2}}.$$
This can be made to look like the t-statistic of the corresponding hypothesis test (to compare the two means) with simple algebraic manipulations, yielding
$$\frac{m_1-m_2}{\sqrt{s_1^2/n_1 + s_2^2/n_2}} \gt \frac{s_1\sqrt{n_2}t_\alpha(n_1) + s_2\sqrt{n_1}t_\alpha(n_2)}{\sqrt{n_1 s_2^2 + n_2 s_1^2}}.$$
The left hand side is the statistic used in the hypothesis test; it is usually compared to a percentile of a Student t distribution with $n_1+n_2$ degrees of freedom: that is, to $t_\alpha(n_1+n_2)$. The right hand side is a biased weighted average of the original t distribution percentiles.
The analysis so far justifies the reply by @Brett: there appears to be no simple relationship available. However, let's probe further. I am inspired to do so because, intuitively, a non-overlap of confidence intervals ought to say something!
First, notice that this form of the hypothesis test is valid only when we expect $s_1$ and $s_2$ to be at least approximately equal. (Otherwise we face the notorious Behrens-Fisher problem and its complexities.) Upon checking the approximate equality of the $s_i$, we could then create an approximate simplification in the form
$$\frac{m_1-m_2}{s\sqrt{1/n_1 + 1/n_2}} \gt \frac{\sqrt{n_2}t_\alpha(n_1) + \sqrt{n_1}t_\alpha(n_2)}{\sqrt{n_1 + n_2}}.$$
Here, $s \approx s_1 \approx s_2$. Realistically, we should not expect this informal comparison of confidence limits to have the same size as $\alpha$. Our question then is whether there exists an $\alpha'$ such that the right hand side is (at least approximately) equal to the correct t statistic. Namely, for what $\alpha'$ is it the case that
$$t_{\alpha'}(n_1+n_2) = \frac{\sqrt{n_2}t_\alpha(n_1) + \sqrt{n_1}t_\alpha(n_2)}{\sqrt{n_1 + n_2}}\text{?}$$
It turns out that for equal sample sizes, $\alpha$ and $\alpha'$ are connected (to pretty high accuracy) by a power law. For instance, here is a log-log plot of the two for the cases $n_1=n_2=2$ (lowest blue line), $n_1=n_2=5$ (middle red line), $n_1=n_2=\infty$ (highest gold line). The middle green dashed line is an approximation described below. The straightness of these curves belies a power law. It varies with $n=n_1=n_2$, but not much.
The answer does depend on the set $\{n_1, n_2\}$, but it is natural to wonder how much it really varies with changes in the sample sizes. In particular, we could hope that for moderate to large sample sizes (maybe $n_1 \ge 10, n_2 \ge 10$ or thereabouts) the sample size makes little difference. In this case, we could develop a quantitative way to relate $\alpha'$ to $\alpha$.
This approach turns out to work provided the sample sizes are not too different from each other. In the spirit of simplicity, I will report an omnibus formula for computing the test size $\alpha'$ corresponding to the confidence interval size $\alpha$. It is
$$\alpha' \approx e \alpha^{1.91};$$
that is,
$$\alpha' \approx \exp(1 + 1.91\log(\alpha)).$$
This formula works reasonably well in these common situations:
Both sample sizes are close to each other, $n_1 \approx n_2$, and $\alpha$ is not too extreme ($\alpha \gt .001$ or so).
One sample size is within about three times the other and the smallest isn't too small (roughly, greater than $10$) and again $\alpha$ is not too extreme.
One sample size is within three times the other and $\alpha \gt .02$ or so.
The relative error (correct value divided by the approximation) in the first situation is plotted here, with the lower (blue) line showing the case $n_1=n_2=2$, the middle (red) line the case $n_1=n_2=5$, and the upper (gold) line the case $n_1=n_2=\infty$. Interpolating between the latter two, we see that the approximation is excellent for a wide range of practical values of $\alpha$ when sample sizes are moderate (around 5-50) and otherwise is reasonably good.
This is more than good enough for eyeballing a bunch of confidence intervals.
To summarize, the failure of two $2\alpha$-size confidence intervals of means to overlap is significant evidence of a difference in means at a level equal to $2e \alpha^{1.91}$, provided the two samples have approximately equal standard deviations and are approximately the same size.
I'll end with a tabulation of the approximation for common values of $2\alpha$. In the left hand column is the nominal size $2\alpha$ of the original confidence interval; in the right hand column is the actual size $2\alpha^\prime$ of the comparison of two such intervals:
$$\begin{array}{ll}
2\alpha & 2\alpha^\prime \\ \hline
0.1 &0.02\\
0.05 &0.005\\
0.01 &0.0002\\
0.005 &0.00006\\
\end{array}$$
For example, when a pair of two-sided 95% CIs ($2\alpha=.05$) for samples of approximately equal sizes do not overlap, we should take the means to be significantly different, $p \lt .005$. The correct p-value (for equal sample sizes $n$) actually lies between $.0037$ ($n=2$) and $.0056$ ($n=\infty$).
This result justifies (and I hope improves upon) the reply by @John. Thus, although the previous replies appear to be in conflict, both are (in their own ways) correct.
Best Answer
It's hard to step in when you had people commenting of the caliber of the names above, but I did tried to understand this the silly way... Using the power of [R] to simulate mathematical problems. So I hope it sheds some light into what these uncertainty quantifications attached to the regression parameters mean - that was the question...
So from the perspective of the frequentist there is this Platonic world of absolute representation of every single individual - the population, and we are looking at the shadows on the wall of the cave - the sample. We know that no matter how much we try we'll be off, but we want to have an idea of how far we'll be from the truth.
We can play god, and pretend to create the population, where everything is perfect, and the parameters governing the relationships between variables are glimmering gold. Let's do that by establishing that the variable $x$ will be related to the variable $y$ through the equation, $y = 10 + 0.4\,x$. We define the x's as
x = seq(from = 0.0001, to = 100, by = 0.0001
(that is $1 \,million$ observations). The y's will therefore be calculated asy <- 0.4 * x + 10
. We can combine these values in a data.frame:population = data.frame(x, y)
.From this population we will take $100$ samples. For each sample, we will randomly select $100$ rows of data from the dataset. Let's define the function for sampling rows:
Notice that we are no longer in paradise - now we have noise (
rnorm
).And we are going to collect both the intercepts and the slopes (I'll call them
betas
) of the OLS linear regression run on each one of these $100$ samples. Let's write some lines of code for this:And combine both into a new data.frame:
reg_lines <- data.frame(intercepts, betas)
. As expected given the normal randomness of the noise the histogram of the slopes will be gaussian looking:And if we plot all the regression lines that we fitted in each single one of the $100$ samples from the rows in the
population
we'll see how any single one is just an approximation, because they do oscillate between a maximum and a minimum in both intercept and slope. This is what they look like:But we do live in the real world, and what we have is just a sample... Just one of those multicolored lines, through which we are trying to estimate the truth (i.e. intercept of $10$ and slope of $0.4$). Let's conjure this sample:
S <- population[sample(nrow(population), 100),]; S$y <- S$y + rnorm(100, 0, 10)
, and its OLS regression line:fit <- lm(y ~ x, data = S)
.Since we are playing god, let's plot our biased sample (dark blue dots of the with dark blue fitted regression line) together with the true line in solid green, and the maximum and minimum combinations of intercepts and slopes we got in our simulation (dashed red lines), giving us an idea of how off we could possibly be from the true line):
Let's quantify this possible error using a Wald interval for the slopes to generate the 5% confidence interval:
coef(fit)[2] + c(-1,1) * 1.96 * summary(fit)$coefficients[4]
, wheresummary(fit)$coefficients[4]
is the calculated standard error of the estimated slope. This gives us;0.2836088 to 0.4311044
(remember the "true" value $0.4$).And for the intercept:
coef(fit)[1] + c(-1,1) * 1.96 * summary(fit)$coefficients[3]
, which give us:9.968347 to 17.640500
.Finally, let's compare these values by those generated by [R] when we type:
Pretty close...
OK, so this is a very intuitive approach at seeing what the confidence intervals are trying to answer. And as for the $p$-values, you can read how they are generated here. In general, the text notes that if the regression coefficient in the population is $0$ ($H_o: \beta = 0$) the $t$-statistic will be:
$$t = \frac{\hat\beta_{yx}-\beta{yz}}{SE_{\hat\beta}}= \frac{\hat\beta_{yx}}{SE_{\hat\beta}}$$.
The $SE_{\hat\beta}$ (which we used in the Wald interval) can be calculated in different ways, although the formula given in the text quoted is:
$SE_{\hat\beta}=\sqrt{\frac{var(e)}{var(x) \, (N-2)}}$. If we calculate this manually:
The variance of the independent variable for our sample is:
var_x <- (sd(S$x))^2 = 719.0691
. The variance for the errors is:var_e <- sum((residuals(fit)- mean(residuals(fit)))^2)/(nrow(S)-1) = 99.76605
. AndN - 2 = 98
(we lose one $df$ both for the intercept and the slope). Hence, $SE_{\hat\beta} = \small 0.03762643$ ((SE <- sqrt(var_e/(var_x * (N - 2))))
). Which happily coincides with that obtained for the slope of x by [R]:So $t=\frac{\hat\beta_{yx}}{SE_{\hat\beta}}= \small 0.3573566 / 0.03762643=9.497488 $ (
(t <- coef(fit)[2]/SE)
). What else? Right, the $p$-value...pt(9.497488, 98, lower.tail = F) = 7.460233e-16 ~ 0
.