Solved – Estimating the Population Mean with the Sample Mean

estimationmeanpopulationsamplesample-size

I am trying to better understand how the sample mean can be used to estimate the population mean. Using the R language, suppose I have the following population:

library(dplyr)
set.seed(123)
pop = rnorm(100000,5,5)
i = 1
population = data.frame(i,pop)

The mean of this population is:

 mean(population$pop)
[1] 4.985157

I take random (small) samples from this population

sample_1 <- population %>% sample_frac(0.01)
mean(sample_1$pop)

4.875

sample_2 <- population %>% sample_frac(0.01)
mean(sample_2$pop)

4.569

sample_3 <- population %>% sample_frac(0.01)
mean(sample_3$pop)

5.13

My Question: All these sample mean estimates, even though they are very small – are so close to the population mean! Is this how sampling works? 100 observations from a 100000 observation population is enough to get a good estimate of the mean?

Thanks!

Best Answer

Let $X_{1},X_{2},\dots ,X_{n}$ are $n$ random samples drawn from a population with overall mean $\mu$ and finite variance $\sigma ^{2}$ and if $\bar {X}_{n}$ is the sample mean, then

We know sample mean (statistic) is an unbiased estimator of the population mean (parameter) i.e., $E[\bar{X_n}]=\mu$
By SLLN we have $\bar{X_n}\overset{a.s.}{\rightarrow}\mu$ and WLLN we have $\bar{X_n}\overset{P}{\rightarrow}\mu$, when $n \to \infty$
By CLT, $\dfrac{\bar{X_n}-\mu}{\sigma/\sqrt{n}}\overset{D}{\rightarrow}N(0,1)$, where a rule of thumb is sample size $n \geq 30$
We can compute the $(1-\alpha)\%$ confidence interval for the population mean by $\bar{X_n}\pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}$

For example, with the following R code snippet we can construct a $95\%$ confidence interval for the population mean:

sigma <- 5
n <- length(sample_1$pop)
x_bar <- mean(sample_1$pop)
# 95% CI
c(x_bar - qnorm(0.975)*sigma/sqrt(n), x_bar + qnorm(0.975)*sigma/sqrt(n))
# [1] 4.804931 5.424726

The following animation shows how the sampling distribution changes when the sample size gets larger:

The next animation shows how the confidence interval changes for different samples and if you repeat drawing random samples (with replacement) for a long time, there is $(1-\alpha)\%$ chance of the population mean falling inside the $(1-\alpha)\%$ confidence interval.

As can be seen from above, since the population is normally distributed, an observation has low probability to have value far away from the population mean (e.g., there are less than $5\%$ points with more than $2$ standard deviations away from mean), that's why even when the sample size is small, there is a very low probability that an observation in population far away from the population mean will be chosen in the sample, which keeps the sample mean close enough to population mean and the confidence interval constructed around the point estimate (sample mean) almost always contains the population mean. This will not be the case (in general), particularly for the population that has a distribution with fat tails.

The next animation shows how the length of the $95\%$ confidence interval decreases as we have larger sample size:

Related Solutions

Solved – Expected standard deviation for a sample from a uniform distribution

The integration is difficult even with as few as $3$ values. Why not estimate the bias in the sample SD by using a surrogate measure of spread? One set of choices is afforded by differences in the order statistics.

Consider, for instance, Tukey's H-spread. For a data set of $n$ values, let $m = \lfloor\frac{n+1}{2}\rfloor$ and set $h = \frac{m+1}{2}$. Let $n$ be such that $h$ is integral; values $n = 4i+1$ will work. In these cases the H-spread is the difference between the $h^\text{th}$ highest value $y$ and $h^\text{th}$ lowest value $x$. (For large $n$ it will be very close to the interquartile range.) The beauty of using the H-spread is that, being based on order statistics, its distribution can be obtained analytically, because the joint PDF of the $j,k$ order statistics $(x,y)$ is proportional to

$$x^{j-1}(1-y)^{n-k}(y-x)^{k-j-1},\ 0\le x\le y\le 1.$$

From this we can obtain the expectation of $y-x$ as

$$s(n; j,k) = \mathbb{E}(y-x) = \frac{k-j}{n+1}.$$

Set $j=h$ and $k=n+1-h$ for the H-spread itself. When $n=4i+1$, $j=i+1$ and $k=3i+1$, whence $s(4i+1; i+1, 3i+1)=\frac{2i}{4i+1}.$

At this point, consider regressing simulated (or even calculated values) of the expected SD ($sd(n)$) against the H-spreads $s(4i+1,i+1,3i+1) = s(n).$ We might expect to find an asymptotic series for $sd(n)/s(n)$ in negative powers of $n$:

$$sd(n)/s(n) = \alpha_0 + \alpha_1 n^{-1} + \alpha_2 n^{-2} + \cdots.$$

By spending two minutes to simulate values of $sd(n)$ and regressing them against computed values of $s(n)$ in the range $5\le n\le 401$ (at which point the bias becomes very small), I find that $\alpha_0 \approx 0.5774$ (which estimates $2\sqrt{1/12}\approx 0.57735$), $\alpha_1\approx 1.091,$ and $\alpha_2 \approx 1.$ The fit is excellent. For instance, basing the regression on the cases $n\ge 9$ and extrapolating down to $n=5$ is a pretty severe test and this fit passes with flying colors. I expect it to give four significant figures of accuracy for all $n\ge 5$.

#
# Expected spread of the j and kth order statistics (k > j) in n
# iid uniform values.
#
sd.r <- function(n,j,k) (k-j)/(n+1)
#
# Expected sd of n iid uniform values.
#
sim <- function(n, effort=10^6) {
  x <- matrix(runif(n * ceiling(effort/n)), ncol=n)
  y <- apply(x, 1, sd)
  mean(y)
}
#
# Study the relationship between sd.r and sim.
#
i <- c(1:7, 9, 15, 30, 300)
system.time({
  d <- replicate(9, t(sapply(i, function(i) c(4*i+1, sim(4*i+1), i))))
})
#
# Plot the results.
#
data <- as.data.frame(matrix(aperm(d, c(2,1,3)), ncol=3, byrow=TRUE))
colnames(data) <- c("n", "y", "i")
data$x <- with(data, sd.r(4*i+1,i+1,3*i+1))

plot(subset(data, select=c(x,y)), col="Gray", cex=1.2,
     xlab="Expected H-spread", ylab="Expected SD (via simulation)")

fit <- lm(y ~ x + I(x/n) + I(x/n^2) - 1, data=subset(data, n > 5))
j <- seq(1, 1000, by=1/4)
x <- sd.r(4*j+1, j+1, 3*j+1)
y <- cbind(x,  x/(4*j+1), x/(4*j+1)^2) %*% coef(fit)
lines(x[-(1:4)], y[-(1:4)], col="#606060", lwd=2, lty=2)
lines(x[(1:5)], y[(1:5)], col="#b0b0b0", lwd=2, lty=3)

points(subset(data, select=c(x,y)), col=rainbow(length(i)), pch=19)
#
# Report the fit.
#
summary(fit)
par(mfrow=c(2,2))
plot(fit)
par(mfrow=c(1,1))
#
# The fit based on all the data.
#
summary(fit <- lm(y ~ x + I(x/n) + I(x/n^2) - 1, data=data))
# 
# An alternative fit (fixing alpha_0).
#
summary(fit <- lm((y - sqrt(1/12))/x ~ I(1/n) + I(1/n^2) + I(1/n^3) - 1, data=data))

Solved – Mean of the bootstrap sample vs statistic of the sample

Let's generalize, so as to focus on the crux of the matter. I will spell out the tiniest details so as to leave no doubts. The analysis requires only the following:

The arithmetic mean of a set of numbers $z_1, \ldots, z_m$ is defined to be

$$\frac{1}{m}\left(z_1 + \cdots + z_m\right).$$
Expectation is a linear operator. That is, when $Z_i, i=1,\ldots,m$ are random variables and $\alpha_i$ are numbers, then the expectation of a linear combination is the linear combination of the expectations,

$$\mathbb{E}\left(\alpha_1 Z_1 + \cdots + \alpha_m Z_m\right) = \alpha_1 \mathbb{E}(Z_1) + \cdots + \alpha_m\mathbb{E}(Z_m).$$

Let $B$ be a sample $(B_1, \ldots, B_k)$ obtained from a dataset $x = (x_1, \ldots, x_n)$ by taking $k$ elements uniformly from $x$ with replacement. Let $m(B)$ be the arithmetic mean of $B$. This is a random variable. Then

$$\mathbb{E}(m(B)) = \mathbb{E}\left(\frac{1}{k}\left(B_1+\cdots+B_k\right)\right) = \frac{1}{k}\left(\mathbb{E}(B_1) + \cdots + \mathbb{E}(B_k)\right)$$

follows by linearity of expectation. Since the elements of $B$ are all obtained in the same fashion, they all have the same expectation, $b$ say:

$$\mathbb{E}(B_1) = \cdots = \mathbb{E}(B_k) = b.$$

This simplifies the foregoing to

$$\mathbb{E}(m(B)) = \frac{1}{k}\left(b + b + \cdots + b\right) = \frac{1}{k}\left(k b\right) = b.$$

By definition, the expectation is the probability-weighted sum of values. Since each value of $X$ is assumed to have an equal chance of $1/n$ of being selected,

$$\mathbb{E}(m(B)) = b = \mathbb{E}(B_1) = \frac{1}{n}x_1 + \cdots + \frac{1}{n}x_n = \frac{1}{n}\left(x_1 + \cdots + x_n\right) = \bar x,$$

the arithmetic mean of the data.

To answer the question, if one uses the data mean $\bar x$ to estimate the population mean, then the bootstrap mean (which is the case $k=n$) also equals $\bar x$, and therefore is identical as an estimator of the population mean.

For statistics that are not linear functions of the data, the same result does not necessarily hold. However, it would be wrong simply to substitute the bootstrap mean for the statistic's value on the data: that is not how bootstrapping works. Instead, by comparing the bootstrap mean to the data statistic we obtain information about the bias of the statistic. This can be used to adjust the original statistic to remove the bias. As such, the bias-corrected estimate thereby becomes an algebraic combination of the original statistic and the bootstrap mean. For more information, look up "BCa" (bias-corrected and accelerated bootstrap) and "ABC". Wikipedia provides some references.

Best Answer

Related Solutions

Solved – Expected standard deviation for a sample from a uniform distribution

Solved – Mean of the bootstrap sample vs statistic of the sample

Related Question