[Math] Why does $\sqrt{n} (\bar X – \mu)/S$ have approximately a $t$-distribution

confidence intervalstatistics

Suppose that the random variables $X_1,\ldots, X_n$ are independent and identically distributed with mean $\mu$ and variance $\sigma^2$. Let $\bar X = (X_1 + \cdots + X_n)/n$ be the sample mean and $S = ((X_1 – \bar X)^2 + \cdots + (X_n – \bar X)^2)/(n-1)$ be the sample variance.

If the random variables $X_i$ are normally distributed, then $\sqrt{n}(\bar X – \mu)/\sigma$ is normally distributed and $\sqrt{n}(\bar X – \mu)/S$ has a $t$-distribution. Moreover, even if the random variables $X_i$ are not normally distributed, then $\sqrt{n}(\bar X – \mu)/\sigma$ is approximately normally distributed, according to the central limit theorem.

Question: In this case (where the random variables $X_i$ are not normally distributed), does the central limit theorem also imply that $\sqrt{n}(\bar X – \mu)/S$ has approximately a $t$-distribution?

This question arises in the context of constructing confidence intervals for the mean of a random variable, in the case where the variance is unknown.

The following passage appears in Ross's book Introduction to Probability and Statistics for Engineers and Scientists:

Our derivations of the $100(1-\alpha)$ percent confidence intervals
for the population mean $\mu$ have assumed that the population
distribution is normal. However, even when this is not the case, if
the sample size is reasonably large then the intervals obtained will
still be approximate $100(1-\alpha)$ percent confidence intervals for
$\mu$. This is true because, by the central limit theorem,
$\sqrt{n}(\bar X – \mu)/\sigma$ will have approximately a normal
distribution, and $\sqrt{n}(\bar X – \mu)/S$ will have approximately a
$t$-distribution.

I'm trying to understand the final statement about $\sqrt{n}(\bar X – \mu)/S$ having approximately a $t$-distribution.

Best Answer

According to the central limit theorem $\sqrt{n}(\bar X - \mu)/\sigma$ tends to standard normal distribution as $n$ tends to infinity. But by the same token, since $s$ tends to $\sigma$ a.s., the variable $\sqrt{n}(\bar X - \mu)/s$ tends to standard normal distribution as well. Since the $t$-distribution is a sequence of distributions and not the single distribution it's hard to define what we mean by "tends" to. If we mean it in the sense of the Kolmogorov-Smirnov distance or something along these lines, we then indeed can say that it approaches the $t$-distribution. But it's only the consequence of the fact that $t$-distribution approaches standard normal i.e. we could say the same thing for any other sequence of distributions converging to normal. It could happen that $t$-distribution is a better asymptotic approximation of $\sqrt{n}(\bar X - \mu)/s$ than standard normal and it probably is in most cases. However, both can be used in the asymptotic case.

Related Solutions

[Math] Central limit theorem confusion

There are several confusions here (I was also very confused when I started learning about that topic :-).)

Exponential random variables have a non zero mean (and are positive). The quantity you should be looking at, which asymptotically converges in distribution to a normal variable is $$\sqrt{n} \left( \frac{\sum_{i = 1}^n X_i}{n} - \mu \right)$$ The $\sqrt{n}$ was essential here, otherwise the distribution of the average will converge to a point mass at $\mu$. That quantity will converge to $N(0,\sigma^2)$. Both $\mu$ and $\sigma$ will be determined by the parameter of the exponential distribution.
The central limit theorem is asymptotic. The quantity $\sqrt{n} \left( \frac{\sum_{i = 1}^n X_i}{n} - \mu \right)$ will have a distribution. Let's call it $F_n$. (it is essential to remember that it depends on $n$). $F_n$ in general is not a normal distribution $N(0,\sigma^2)$. The central limit theorem tells us that that distribution gets in a certain sense closer and closer to $N(0,\sigma^2)$ as $n \to \infty$.

[Math] Calculation of confidence interval for population mean if population not normally distributed

The best way to get a good confidence interval is to use everything you know about the population distribution. In the first three examples below, we pretend to have successively less information about the population, so that we can see the effect of knowledge and assumptions on confidence intervals.

Suppose I have $n = 50$ observations from a population safely assumed to be normal, sorted in increasing order, as displayed and summarized below:

sort(x)
 [1]  71  72  73  74  80  82  82  83  84  86  86  87  88
[14]  90  91  91  92  93  94  94  95  95  96  96  97  97
[27]  98  99 100 101 101 101 102 103 103 105 105 106 108
[40] 108 110 110 111 111 112 112 118 121 129 131

summary(x);  sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  71.00   88.50   97.00   97.48  105.75  131.00 
 [1] 13.62656

Estimating mean, knowing data are normal and knowing population SD: If I know these data are from a normal distribution with standard deviation $\sigma = 15,$ then a 95% confidence interval for the population mean $\mu$ is $\bar X \pm 1.96\frac{\sigma}{\sqrt{n}},$ as in your Question. This computes to $(94.63, 100.33).$

pm = c(-1,1);  mean(x) + pm*1.96*sqrt(15/sqrt(50))
[1]  94.62531 100.33469

Estimating mean, assuming normality, but not knowing SD: If I know the data are normal but don't know $\sigma,$ then I can estimate it by the sample standard deviation $S = 13.63,$ and a confidence interval based on Student's t distribution with $\nu = 49$ degrees of freedom is $\bar X \pm t^*\frac{S}{\sqrt{n}}.$ For our data this is the interval $(93.61, 101.35).$

mean(x) + pm*qt(.975,49)*sd(x)/sqrt(50)
[1]  93.60737 101.35263

Estimating mean, assuming randomly sampled data: If I am not willing to assume the data are normal, I can use a nonparametric bootstrap procedure: I know the distribution of $D = \bar X - \mu,$ then I could find cut-off points $L$ and $U$ such that $$.95 = P(L \le D = L - \bar X \le U) = P(\bar X - U \le \mu \le \bar X - L),$$ so that a 95% CI for $\mu$ would be of the form $(\bar X - U, \bar X - L).$

Not knowing the distribution of $D$ I can use a bootstrap procedure to get approximate values $L^*$ and $U^*$ of $L$ and $U,$ respectively. Temporarily, I use $\mu^*= \bar X_{\text{obs}}= 97.48$ as a proxy for $\mu.$ Also, I take a large number $B$ of re-samples of size $n = 50$ with replacement from the data, find $\bar X^*$ of each re-sample and use percentiles .025 and .975 of the $B$ values $D^* = \bar X^* - \mu^*$ and use them for $L^*$ and $U^*,$ respectively. Then, returning $\bar X_{obs}$ to its role as the mean of the data, a 95% nonparametric bootstrap CI for $\mu$ is of the form $(\bar X_{obs} - U^*, \bar X_{obs} - L^*).$

In the R code below, $*$'s representing bootstrap quantities are shown by .re. The bootstrap CI from the run shown is $(93.78,\, 101.18).$ Because this is a random process subsequent runs (with different seeds) will give slightly different CIs.

set.seed(819);  B = 10000;  n = 50;  a.obs = mean(x)
d.re = replicate( B, mean(sample(x, n, repl=T))-a.obs )
UL = quantile(d.re, c(.975,.025))
a.obs - UL
 97.5%   2.5% 
 93.78 101.18

It is typical for nonparametric bootstrap confidence intervals to be slightly shorter than t confidence intervals. The bootstrap CIs are based only on the data, which does not necessarily capture information about the tails of the normal distribution, which are not 'heavy' but do extend to $\pm\infty.$ [If the data really are normal, we should use a t CI, not a nonparametric bootstrap CI.]

The data were simulated using set.seed(818); x = round(rnorm(50, 100, 15)), so we know $\mu = 100,$ and thus that all three of the CIs above happen to cover the true value of $\mu.$ [Of course, $\mu$ would not be known in a real-world situation.]

Estimating mean of exponential data: Suppose we have $n = 50$ observations from an exponential population with unknown mean $\mu.$

sort(y)
 [1]   1   2   7   7   8   9  10  13  14  21  21  22  23
[14]  24  28  36  45  46  49  52  54  55  60  61  64  65
[27]  66  71  89  91  96 100 128 132 146 152 152 153 159
[40] 162 167 169 174 191 203 236 286 301 456 480
summary(y);  sd(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   23.25   64.50  103.14  152.75  480.00 
[1] 106.5341

For exponential data, it is known that $\bar X/\mu \sim \mathsf{Gamma}(\text{shape}=n,\,\text{rate}=n).$ [This result can be proved using moment generating functions.] Thus $$.95 = P\left(L=0.742 \le \frac{\bar Y}{\mu} \le U = 2.296\right) = P\left(\frac{\bar Y}{U} < \mu < \frac{\bar Y}{L}\right),$$ so that for our data a 95% CI for $\mu$ is $(\bar Y/U,\, \bar Y/L) = (79.60, 138.96).$ [The vector y was simulated according to $\mathsf{Exp}(\mu=100).]$

qgamma(c(.025,.975), 50, 50)
[1] 0.7422193 1.2956120
mean(y)/qgamma(c(.975,.025), 50, 50)
[1]  79.60717 138.96163

By contrast, if we had taken the data to be normally distributed and used a t confidence interval the incorrect result would have been $(72.86, 133.42).$ With a sample as large as $n = 50,$ this t CI is perhaps not catastrophically wrong -- but wrong nevertheless.

Addendum per Comment: The following simulation shows that a "95%" t CI for $\mu$ based on $n = 50$ exponential observations covers less than 93% of the time. (With $n = 25,$ coverage is only about 91%.) It is in that long-run sense that such t CIs are 'incorrect'.

set.seed(819); m = 10^5; n = 50; LCL = UCL = numeric(m)
for (i in 1:m) {
  x = rexp(n, .01); a = mean(x);  s = sd(x)
  LCL[i] = a - 1.96*s/sqrt(n);  UCL[i] = a + 1.96*s/sqrt(n) }
mean((LCL < 100) & (UCL > 100))
[1] 0.92803

Best Answer

Related Solutions

[Math] Central limit theorem confusion

[Math] Calculation of confidence interval for population mean if population not normally distributed

Related Question