Statistics – How Does Accuracy of a Survey Depend on Sample Size and Population Size?

samplingstatistics

Which survey is more accurate? Assume the samples are taken perfectly randomly.

  1. A sample of 100 people out of a population of 1000 (sample is 10% of population)

  2. A sample of 1000 people out of a population of 1000000 (sample is 0.1% of population)


I remember my lecturer saying something like "when the sample size is small compared to the population, the accuracy depends almost all on the sample size, the population size is unimportant" . Is there a name for that result? It's quite surprising at first.

I'd love to see some graphs of these functions.

If it helps, here's a concrete example (made up by me).

An unknown proportion p of the population favour candidate Alice. The rest favour Bob. We take a random sample size k of the population (size n), and ask their preferences, to come up with an estimate p-hat.

How does the expected error $\mathbb{E}|\hat{p} – p|$ depend on k and n? And in the limit $n\to\infty$?

Best Answer

Note: For convenience only I use in the following $N$ for the size of the population and $n$ for the sample size.

In order to answer OPs questions we start with some preliminary work and describe the current situation in somewhat more detail.

Current situation:

Here we have a simple random sampling, meaning that every possible combination of $n$ units from a population of size $N$ is equally likely to be the sample selected.

We are in a sampling situation where the object is to estimate the proportion of units in a population having some attributes. In such a situation, the variable of interest is an indicator variable: $y_i=1$ if unit $i$ has the attribute, and $y_i=0$ if it does not.

Writing $p$ for the proportion in the population of size $N$ with the attribute \begin{align*} p=\frac{1}{N}\sum_{i=1}^{N}y_i=\mu \end{align*} the finite population variance is \begin{align*} \sigma^2&=\frac{\sum_{i=1}^{N-1}(y_i-p)^2}{N-1}=\frac{\sum_{i=1}^{N-1}y_i^2-Np^2}{N-1} =\frac{Np-Np^2}{N-1}\\ &=\frac{N}{N-1}p(1-p) \end{align*} Now letting $\hat{p}$ denote the proportion in the sample of size $n$ with the attribute \begin{align*} \hat{p}=\frac{1}{n}\sum_{i=1}^n{y_i}=\bar{y} \end{align*} the sample variance is \begin{align*} s^2&=\frac{\sum_{i=1}^{n-1}(y_i-\bar{y})^2}{n-1}=\frac{\sum_{i=1}^{n-1}y_i^2-n\hat{p}^2}{n-1}\\ &=\frac{n}{n-1}\hat{p}(1-\hat{p})\\ \end{align*}

Note the sample proportion is the sample mean of a simple random sample, it is unbiased for the population proportion and has variance \begin{align*} \mathop{var}(\hat{p})=\frac{N-n}{N-1}\cdot\frac{p(1-p)}{n}\tag{1} \end{align*}

Before we can answer OPs questions we have to do some general

Considerations regarding accuracy:

Suppose that one wishes to estimate a population parameter $\theta$ - for example the population mean or total or proportion of an attribute of the units of the population with an estimator $\hat{\theta}$. Then we would wish the estimate to be close to the true value with high probability.

So, specifying a maximum allowable difference $d$ between the estimate and the true value and allowing for a small probability $\alpha$ that the error may exceed that difference, the challenge is to choose a sample size $n$ such that \begin{align*} P(|\hat{\theta}-\theta|>d)<\alpha\tag{2} \end{align*} If the estimator $\hat{\theta}$ is an unbiased, normally distributed estimator of $\theta$, then $\frac{\hat{\theta}-\theta}{\sqrt{\mathop{var}(\hat{\theta})}}$ has a standard normal distribution. Letting $z$ denote the upper $\frac{\alpha}{2}$ point of the standard normal distribution yields \begin{align*} P\left(\frac{|\hat{\theta}-\theta|}{\sqrt{\mathop{var}(\hat{\theta})}}>z\right) =P\left(|\hat{\theta}-\theta|>z\sqrt{\mathop{var}(\hat{\theta})}\right)=\alpha \end{align*}

Now, since $d$ and the expression (2) provide us with a precise idea of accuracy, we are ready to harvest.

Observe, that the variance of the estimator $\hat{\theta}$ decreases with an increasing sample size $n$, so that the inequality above will be satisfied if we can choose $n$ large enough to make \begin{align*} z\sqrt{\mathop{var}(\hat{\theta})}\leq d\tag{3} \end{align*}

These are the relevant parameters to deal with accuracy. Next we consider

Sample size $n$ for estimating a proportion:

To obtain an estimator $\hat{p}$ having probability at least $1-\alpha$ of being no farther then $d$ from the population proportion, the sample size formula based on the normal approximation gives according to (1) and (3) \begin{align*} \mathop{var}(\hat{\theta})&=\frac{d^2}{z^2}\\ \frac{N-n}{N-1}\cdot\frac{p(1-p)}{n}&=\frac{d^2}{z^2}\\ \end{align*} We obtain by setting $n_0=\frac{z^2}{d^2}p(1-p)$ \begin{align*} n=\frac{1}{\frac{N-1}{N}\cdot\frac{1}{n_0}+\frac{1}{N}}\tag{3} \end{align*}

Note that the formula depends on the unknown population proportion $p$. Since no estimate of $p$ is available, a worst-case value of $p=\frac{1}{2}$ can be used in determining the sample size. This approach is justified since the quantity $p(1-p)$, and hence the value of $n$ assumes its maximum value when $p=\frac{1}{2}$.

Note: When $N$ is large compared with the sample size $n$ then formula (3) reduces to

\begin{align*} n&\simeq \lim\limits_{N\rightarrow \infty}\frac{1}{\frac{N-1}{N}\cdot\frac{1}{n_0}+\frac{1}{N}}=n_0 \end{align*} Since then $n=n_0$ we obtain \begin{align*} n=\frac{z^2}{d^2}p(1-p)\tag{4} \end{align*} and we see in accordance with OPs lecturer, that in case the sample size $n$ is small compared with the population size the accuracy $d$ depends on the sample only.

With regard to one of OPs questions I'm not aware of a specific term for this circumstance. But, sometimes this is named finite population correction.

Which scenario is more accurate:

To answer this question we now transform (3) to obtain the difference $d$

\begin{align*} d=z\sqrt{\frac{N-n}{(N-1)n}p(1-p)} \end{align*}

Assuming an estimation for the true proportion with probability $0.95$ ($\alpha=0.05$) and taking the worst-case probability $p=0.5$ we obtain the formula

\begin{align*} d=1.96\sqrt{\frac{N-n}{(N-1)n}\cdot\frac{1}{2}\cdot\frac{1}{2}}=0.98\sqrt{\frac{N-n}{(N-1)n}} \end{align*}

We observe in case 1: $N=1000, n=100$ \begin{align*} d=0.98\sqrt{\frac{900}{999\cdot100}}\simeq 0.0930 \end{align*} and in case 2: $N=100000, n=1000$ \begin{align*} d=0.98\sqrt{\frac{999000}{999999\cdot1090}}\simeq 0.0310 \end{align*}

and conclude, that the accuracy of case 2 is greater than that of case 1 provided the interpretation is according to the modeling above.

Note: This answer is mostly based upon Sampling, chapter 5: Estimating Proportions, Ratios and Subpopulation Means by Steven K. Thompson.

Related Question