Solved – Why does the central limit theorem work with a single sample

central limit theoremsampling

I have always been taught that the CLT works when you have repeated sampling, with each sample being large enough. For example, imagine I have a country of 1,000,000 citizens. My understanding of CLT is that even if the distribution of their heights was not normal, if I took 1000 samples of 50 people (i.e. conduct 1000 surveys of 50 citizens each), then calculated their mean height for each sample, the distribution of these sample means would be normal.

However, I have never seen a real world case where researchers took repeated samples. Instead, they take one big sample (i.e. survey 50,000 citizens about their height) and work from that.

Why do statistics books teach repeated sampling and in the real world researchers only conduct a single sample?

Edit: The real world case I am thinking about is doing statistics on a dataset of 50,000 twitter users. That dataset obviously isn't repeated samples, it is just one big sample of 50,000.

Best Answer

The CLT (at least in some of its various forms) tells us that in the limit as $n\to\infty$ distribution of a single standardized sample mean ($\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}$) converges to a normal distribution (under some conditions).

The CLT does not tell us what happens at $n=50$ or $n=50,000$.

But in attempting to motivate the CLT, particularly when no proof of the CLT is offered, some people rely on the sampling distribution of $\bar{X}$ for finite samples and show that as larger samples are taken that the sampling distribution gets closer to the normal.

Strictly speaking this isn't demonstrating the CLT, it's nearer to demonstrating the Berry-Esseen theorem, since it demonstrates something about the rate at which the approach to normality comes in -- but that in turn would lead us to the CLT, so it serves well enough as motivation (and in fact, often something like the Berry-Esseen comes closer to what people actually want to use in finite samples anyway, so that motivation may in some sense be more useful in practice than the central limit theorem itself).

the distribution of these sample means would be normal.

Well, no, they would be non-normal but they would in practice be very close to normal (heights are somewhat skew but not very skew).

[Note again that the CLT really tells us nothing about the behavior of sample means for $n=50$; this is what I was getting at with my earlier discussion of Berry-Esseen, which does deal with how far from a normal cdf the distribution function of standardized means can be for finite samples]

The real world case I am thinking about is doing statistics on a dataset of 50,000 twitter users. That dataset obviously isn't repeated samples, it is just one big sample of 50,000.

For many distributions, a sample mean of 50,000 items would have very close to a normal distribution -- but it's not guaranteed, even at n=50,000 that you will have very close to a normal distribution (if the distribution of the individual items is sufficiently skewed, for example, then the distribution of sample means may still be skew enough to make a normal approximation untenable).

(The Berry-Esseen theorem would lead us to anticipate that exactly that problem might occur -- and demonstrably, it does. It's easy to give examples to which the CLT applies but for which n=50,000 is not nearly a large enough sample for the standardized sample mean to be close to normal.)