Confidence Interval – Interpreting Overlapping 95% Confidence Limits

confidence intervalmeasurement errort-test

I cam across these two old blog posts on displayed error bars and tried to work through the result. I believe I am making a mistake somewhere, but I'm not sure where. Let me describe the scenario first, and lay out my reasoning.

First, the scenario:

Suppose that we have a plot with the measurements of a particular quantity $x$ for two different populations, $A$ and $B$. Let us assume that $x$ is Gaussian distributed.

We find that the two measurements have means $\bar{x}_1$, $\bar{x}_2$.

We make a plot of both measurements with their $2\sigma$ confidence limits. For simplicity, let us say that both datasets have the same standard deviation $s$.

The confidence limits overlap to some extent. The posts ask: to what extent do they overlap so that this result significant at the $\alpha = 0.05$ level?

My attempt at answering this

Let us construct the statistic $z = \frac{\bar{x}_1 – \bar{x}_2}{2 s}$.

The standard deviation of $z$ is then $\sigma_z = \sqrt{\left(\frac{\partial z}{\partial \bar{x}_1}\right)^2 s_1^2 + \left(\frac{\partial z}{\partial \bar{x}_2}\right)^2 s_2^2} \quad = \frac{1}{\sqrt{2}}$, since $s_1 = s_2 = s$.

Now, we can rephrase our problem as the search for a value $z_{\star}$ such that $p(-z_{\star} \leq z \leq z_{\star} )= 0.95$, given that
$z \sim \mathcal{N}(\mu_z = 0, \sigma_z = \frac{1}{\sqrt{2}})$ — i.e. the null hypothesis is that $z$ is normally-distributed about a mean of $0$ and standard deviation $1/\sqrt{2}$.

Going to Mathematica, I find that $z_{\star} \approx 1.386$.

To try to interpret what this means, let us now re-write $z = \frac{\bar{x}_1 – \bar{x}_2}{2s} = \frac{\Delta \bar{x}}{w}$, where $w$ is the length of the $2\sigma$ "error bars".

Reaching statistical significance when $|z| > 1.386 \; z_{\star}$ implies that we can have $|\Delta \bar{x}| \leq 1.386 \; w$, so the "error bars" can significantly overlap.

This seems at odds with the statement here that the error bars "can overlap by as much as 25% of their total length and still show a significant difference."

So: where is the gap in my reasoning? (Is it in the interpretation of the standard deviation/standard error in the $t$-test?)

(Btw, I don't think the definition of 95% CLs in these posts is technically correct, with the usual mixing up of Bayesian and Frequentist interpretations. I've tried to avoid this in my question, but let me know if I can be clearer.)

Best Answer

Had another go at this, corrected some of the framing of my query and got to the point where there can be up to $29\%$ overlap (to be defined below) between the confidence limits.

1. Re-framing problem:

We have two groups, $1$ and $2$, and we measure some quantity $Q$ for samples of size $n_1$ and $n_2$ from $1$ and $2$ respectively.

Let us assume that the $Q$ measurements for the populations for $1$ and $2$ are Gaussian distributed with (population) means $\mu_i$ and variance $\sigma_i^2$, for $i=1,2$.

We measure the $Q$ of our samples and find mean values $\bar{x}_1$, $\bar{x}_2$ and the Bessel corrected sample variances $s_1^2$, $s_2^2$.

Following standard procedure, we calculate the $95\%$ confidence intervals for the means $\mu_i$, $[\bar{x}_i - w_i, \bar{x}_i + w_i]$.

Here is the question: by how much can these confidence intervals overlap, and still yield a statistically significant difference for the population means at the $\alpha=0.05$ level? (Apologies if there is a factor of two difference from your definition.)

2. Assumptions for tractability:

From this answer, it seems that the general case of this problem is related to the so-called Behrens-Fisher problem. I just want some heuristic feeling for what's happening, so let's assume:

  • $n_1 \approx n_2 = n$ : the sample sizes are approximately the same
  • $n \gg 1$ : the sample sizes are quite large (let me say at least $\mathcal{O}(10)$).
  • $s_1 \approx s_2 = s$ : the sample variances are approximately the same

3. Definition of overlap:

With the previous assumptions, and some foreknowledge that the (half-)widths are the non related to the sample standard deviations $s_i$, we can take the confidence intervals to have the same width, i.e. $[\bar{x}_i - w, \bar{x}_i + w]$.

Let us define the overlap as $r = 1 - \dfrac{|\bar{x}_1 - \bar{x}_2|}{2w}$ when $|\bar{x}_1 - \bar{x}_2| \leq 2w$, and $0$ otherwise.

Then $r$ is just the ratio of the common overlap range to the entire width $2w$.

4. Relation of the width $w$ to $s$

(This should be boilerplate stuff, so I'll be brief) For each population $i$, we define the $t$-statistic

$t_i = \dfrac{\bar{x}_i - \mu_i}{s_i/\sqrt{n_i}} \approx \dfrac{\bar{x}_i - \mu_i}{s/\sqrt{n}}$, with $\nu_i$ degrees of freedom, $\nu_i = n_i - 1 \approx n - 1$.

We want critical values of $t_{\alpha}$ where $\mathrm{Pr}(-t_{\alpha} < t_i < t_{\alpha}) = 0.95$.

With the assumption that $n$ and hence $\nu_i$ are large, we find that $t_\alpha \approx 2$ (more like 1.96, but let's keep that in our pocket).

From the definition of $t_i$, we find that the $95\%$ CIs are then (approximately)

$[\bar{x}_i - 2 s/\sqrt{n}, \bar{x}_i + 2 s/\sqrt{n}]$, and comparing with our usage of $w$ above, we see that $w = 2s/\sqrt{n}$.

5. Statistical significance of the difference

Let us define the difference $\delta = \bar{x}_1 - \bar{x}_2$. Our null hypothesis is that $\delta \sim \mathcal{N}(\mu_{\delta}, \sigma_{\delta})$, with $\mu_{\delta}=0$.

The variance of $\delta$ is $\sigma_{\delta}^2 = \dfrac{\sigma_{1}^2}{n_1} + \dfrac{\sigma_2^2}{n_2} \approx \dfrac{2s^2}{n}$.

Then we have the $t$-statistics $t_{\delta}= \dfrac{\delta}{\sqrt{2s^2/n}}$, with the $\nu_{\delta} = (n_1 - 1) + (n_2 -1) \approx 2 (n-1)$.

Again, assuming $n$ is large, we can find the critical value $t_{\delta,\alpha} \approx 2$ such that $\mathrm{Pr}(t_{\delta,\alpha} < t_{\delta} < t_{\delta,\alpha}) = 0.95$, and then invert this to find that a $95\%$ CL for $\mu_{\delta}=0$ is given by $[ -2\sqrt{2s^2/n} ,\; 2\sqrt{2s^2/n}]$.

6. Back to the overlap

If I'm not making some common fallacy here, the above implies that finding $ |\bar{x}_1 - \bar{x}_2| = | \delta | > 2\sqrt{2s^2/n}$ would be statistically significant at the $\alpha=0.05$ level.

In 4., we saw that $w = 2s/\sqrt{n}$, and using the definition of the overlap ratio in 3., the above relation then implies

$2w(1-r) = |\bar{x}_1 - \bar{x}_2| > w \sqrt{2}$, which we can rearrange to find that, if the overlap ratio

$r < 1 - \dfrac{1}{\sqrt{2}} \approx 0.29$,

then the difference between the groups is statistically significant.

I'd be happy to accept that e.g. the difference between $1.96$ and $2$ in the determination of the critical values of the $t$-distribution may be the difference between this $29\%$ value and the $25\%$ claim, but also happy for any criticism.

Related Question