Solved – Sampling with or without replacement

sampling

I don't know a lot about sampling methods.

I have a large population of size 2,000,000. I used one of those sample size calculators. It says that I need sample size of approximately 10,000.

I am trying to find the probability p of success for the population. It is not feasible for me to test all 2,000,000 members of the population. That is why I am sampling.

I assume that the sample size calculator means a simple random sample without replacement. I have read that a simple random sample with replacement ensures that the covariance between two variables is 0, i.e., independent.

When should one choose with replacement instead of without replacement?

If we sample with replacement, then we are simply performing Bernoulli trials. I suppose this makes applying statistical tools (which?) easier.

Again, sampling ignoramus here.

Best Answer

From finite population perspective, the difference in variances of the sample means or totals obtained via sampling with replacement (SRSWR) and sampling without replacement (SRSWOR) is captured by the finite population correction (FPC): $$ \mathbb{V}_{\rm SRSWOR}[\bar y] = \Bigl( 1 - \frac{n}{N}\Bigr) \mathbb{V}_{\rm SRSWR}[\bar y] $$ where $n$ is the sample size, $N$ is the population size, and the FPC is the parentheses. For your problem, the FPC = 1 - 10,0000/2,000,000 = 1 - 1/200 = 0.995, and frankly I would not bother chasing that factor down, and treat it as being equal to 1. I typically tell my students to start keeping track of FPC when the sampling fraction $n/N \ge 0.1$.

Sometimes, the decision between SRSWOR and SRSWR is that of logistics, i.e., depends on how easy it is to organize one or the other. A simple method to draw an SRSWOR is to assign a random number $U_i \sim \mbox{i.i.d. } U[0,1]$ to every record $i=1,\ldots,N$, sort by $U_I$ and take the first $n$ entries. A simple method to draw SRSWR is to produce $n$ random numbers $V_j \sim \mbox{i.i.d. } U[0,1]$ and take units with indices $\{ [N V_j+1], j=1, \ldots, n \}$ (the brackets stand for the integer part). Depending on how your population (referred to as frame in sampling terminology) is organized, one may be easier than the other, or none may be feasible at all.

The standard sampling reference I give is Lohr (2009).

Related Question