Solved – Estimating the Population Mean with the Sample Mean

estimationmeanpopulationsamplesample-size

I am trying to better understand how the sample mean can be used to estimate the population mean. Using the R language, suppose I have the following population:

library(dplyr)
set.seed(123)
pop = rnorm(100000,5,5)
i = 1
population = data.frame(i,pop)

The mean of this population is:

 mean(population$pop)
[1] 4.985157

I take random (small) samples from this population

sample_1 <- population %>% sample_frac(0.01)
mean(sample_1$pop)

4.875

sample_2 <- population %>% sample_frac(0.01)
mean(sample_2$pop)

4.569

sample_3 <- population %>% sample_frac(0.01)
mean(sample_3$pop)

5.13

My Question: All these sample mean estimates, even though they are very small – are so close to the population mean! Is this how sampling works? 100 observations from a 100000 observation population is enough to get a good estimate of the mean?

Thanks!

Best Answer

Let $X_{1},X_{2},\dots ,X_{n}$ are $n$ random samples drawn from a population with overall mean $\mu$ and finite variance $\sigma ^{2}$ and if $\bar {X}_{n}$ is the sample mean, then

  • We know sample mean (statistic) is an unbiased estimator of the population mean (parameter) i.e., $E[\bar{X_n}]=\mu$
  • By SLLN we have $\bar{X_n}\overset{a.s.}{\rightarrow}\mu$ and WLLN we have $\bar{X_n}\overset{P}{\rightarrow}\mu$, when $n \to \infty$
  • By CLT, $\dfrac{\bar{X_n}-\mu}{\sigma/\sqrt{n}}\overset{D}{\rightarrow}N(0,1)$, where a rule of thumb is sample size $n \geq 30$
  • We can compute the $(1-\alpha)\%$ confidence interval for the population mean by $\bar{X_n}\pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}$

For example, with the following R code snippet we can construct a $95\%$ confidence interval for the population mean:

sigma <- 5
n <- length(sample_1$pop)
x_bar <- mean(sample_1$pop)
# 95% CI
c(x_bar - qnorm(0.975)*sigma/sqrt(n), x_bar + qnorm(0.975)*sigma/sqrt(n))
# [1] 4.804931 5.424726

The following animation shows how the sampling distribution changes when the sample size gets larger:

enter image description here

The next animation shows how the confidence interval changes for different samples and if you repeat drawing random samples (with replacement) for a long time, there is $(1-\alpha)\%$ chance of the population mean falling inside the $(1-\alpha)\%$ confidence interval.

enter image description here

  • As can be seen from above, since the population is normally distributed, an observation has low probability to have value far away from the population mean (e.g., there are less than $5\%$ points with more than $2$ standard deviations away from mean), that's why even when the sample size is small, there is a very low probability that an observation in population far away from the population mean will be chosen in the sample, which keeps the sample mean close enough to population mean and the confidence interval constructed around the point estimate (sample mean) almost always contains the population mean. This will not be the case (in general), particularly for the population that has a distribution with fat tails.

The next animation shows how the length of the $95\%$ confidence interval decreases as we have larger sample size:

enter image description here

Related Question