[Math] Is sample variance always less than or equal to population variance

pythonstatistics

I was reading this wikipedia article on Bessel's correction: https://en.wikipedia.org/wiki/Bessel%27s_correction. The article says that sample variance is always less than or equal to population variance when sample variance is calculated using the sample mean. However, if I create a numpy array containing 100,000 random normal data points, calculate the variance, then take 1000 element samples from the random normal data, I find that many of my samples have a higher variance than the 100,000 element population.

import numpy as np

rand_norm = np.random.normal(size=100000)

# save the population variance
pop_var = np.var(rand_norm, ddof=0)

# execute the following 2 lines a few times and and I find a variance of the 
# sample that is higher than the variance of rand_normal
samp=np.random.choice(rand_norm, 1000, replace=True)

# calculate the sample variance without correcting the bias (ddof = 0) 
# I thought that the variance would always be less than or equal to pop_var.
np.var(samp,ddof=0)

Why am I getting sample variance which is greater than the population variance?

Best Answer

You have misinterpreted the article. The passage you are looking at never says anything about the actual population variance.

The passage literally says:

Now a question arises: is the estimate of the population variance that arises in this way using the sample mean always smaller than what we would get if we used the population mean?

The pronoun "what" refers to an estimate of the population variance. To spell it out more explicitly, the article compares two ways of estimating the population variance:

  1. Subtract the sample mean from each observed value in the sample. Take the square of each difference. Add the squares. Divide by the number of observations.

  2. Subtract the population mean from each observed value in the sample. Take the square of each difference. Add the squares. Divide by the number of observations.

The article then says that Method 1 always gives a smaller result except in the case where the sample mean happens to be exactly the same as the population mean, in which case both methods give the same result.

This is a simple consequence of the not-so-simple fact that if you take any finite list of numbers $(x_1, x_2, \ldots, x_n)$ and consider the function $f(m)$ defined by $$ f(m) = (x_1 - m)^2 + (x_2 - m)^2 + \cdots + (x_n - m)^2, $$

the smallest value of $f(m)$ occurs when $m$ is the mean of that list of numbers, that is, when $m$ is the sample mean.

Notice that none of the preceding statements compared anything with the actual population variance. The actual population variance could be unknown. All the above statements are concerned only with estimates of the variance.

All of this does not mean that every sample will underestimate the population variance. We might draw a sample in which the data values are unusually far from the population mean. But in that case the variance we would compute using Method 2 would overestimate the variance by even more than Method 1. And while this can happen, it is not the usual thing to happen. More often the variance computed by Method 2 is nearly correct or smaller than the true population variance, and the variance computed by Method 1 is simply too small.

That's the thing about statistics like this. You can use a bad method and yet it sometimes will give you a correct result just by luck.

Related Question