Solved – Philosophy of sampling for experiments: finite versus infinite

experiment-designsample-sizesampling

What is the difference between a finite population and an infinite one – when you are designing an experiment (sample/power and interpretation of the results)?

Say a company has a database of 20,000 customers. Given that a response to some stimulus is relatively small, and a meaningful minimum detectable difference is also small, if you run a power analysis for a 2 sample proportion, you may find that you need 2 groups of 15,000 for the experiment.

Do you quit and say you cant experiment on this population? Or do you (somehow) instead treat the population as finite and run a power analysis that way? What are the implications?

ADD:

What I would like to know, and wanted to add this detail in case the last part of the question wasn't completely clear – is what the difference is between

The inference with assuming an infinite population, say the inference from a logistic regression model with glm() in R.
The inference with assuming a finite population, say inference from a logistic regression model with svyglm in R?

Will #1 allow inference about the wider population / data generating process and #2 only allow inference about that particular population (which is assumed fixed)?

Best Answer

You can do the power analysis assuming a finite population. Because the variance of the estimate goes to $0$ as the sample size gets close to the population size, this makes a big difference. The variance for a binomial proportion based on the infinite population assumption would be $$p(1-p)/n$$ where $n$ is the sample size. But if the population size is $N$ it will be $$[p(1-p)/n][1-n/N].$$ The finite population correction factor, $(1-n/N)$, will make it go to $0$ as $n$ approaches $N$ rather than be $p(1-p)/N$ that you would get for a single proportion assumong an infinite population. For your two sample problem the formula is a little more complicated but the idea is the same.

Related Solutions

Solved – How to design experiments for Market Research (with a twist)

One approach to your problem is using a stratified sample. One purpose of stratification is making sure certain domains (groups) of the population are represented in the sample, which otherwise would be represented too sparsely for valid inference, e.g. due to small selection probability.

For example, if "Native Americans" is an important group in terms of your estimates from the 'likeability model', but their selection probability is very small, a simple random sample (SRS) of size $n=50$ might contain no or only very few units of this type. If you then include Nat. Am. as an indicator variable in the model, the estimates will perhaps be extremely unreliable (large standard errors), or the parameters cannot be estimated at all. The goal of a stratified sample is to avoid this.

Stratification means selecting units with a higher probability than they would have in a SRS. In estimating your logistic/polynomial regression, you will be able to use stratification weights (design weights) to adjust for the higher selection probability. A weight is then commonly defined as $$w_i=\frac{\pi_s}{\pi_{pop}},$$ where $\pi_s$ is the selection probability into the stratified sample, and $\pi_{pop}$ is the selection probability when using a SRS.

The problem in your particular application is that you probably cannot stratify for all characteristics you mention, given the small sample size (say $n=50$). In stratification, you usually need to cross all characteristics and sample from all cells of the resulting contingency table. The number of cells quickly grows with the number of characteristics and categories of each characteristic, and at one point of complexity, it is not possible anymore to fill all cells sufficiently given a fixed $n=50$.

My advice therefore is to look at your characteristics and make a selection as follows. First, make a list of all characteristics that you want to have in the final model, because you assume that they will have predictive power for 'likeability' or they identify groups that are important in the 'bidding process'. Second, from these characteristics, distinguish between those that imply a high and low selection probability during sampling. A low selection probability is one that will probably give you too few observations in one of the categories given a SRS sample of size $n$.

For example, 'gender' usually will be a well-represented variable with 50/50 probability in the pop., so even if $n=50$ you will have 'sufficient' men and women, but Nat. Am. might not be a variable of this type, but still important for your model. A power analysis might provide further guidance if needed, but it depends on the particular model and might be very complex for polytomous regression.

The characteristics with too low selection probability are the candidates for stratification, whereas the variables with high enough / balanced selection probability across their categories can be ignored in sampling design. Now that you have identified the crucial strata for your population and model, you can build the sampling design strategy on them (i.e. randomly sample from all relevant strata to fill all 'cells').

I hope that when doing this you will end up with few enough strata to go ahead with a sample of size $n=50$.

Power Analysis – Adjusting Calculations with Finite Population Correction (FPC)

Update: 2014-02-06: changed text to be more emphatic that fpc should not be used in a causal analysis **Update: 2014-02-04: impact of the randomized experimental design

This question has raised some fundamental issues.

You stated in your update that a researcher can control the make-up of the experimental groups. Not so. Even if one randomized an entire population, there would be imbalance, perhaps trivial, in every variable. Even with some kind of balancing algorithm, which would destroy the randomization, one can never arrange for identity of the means of the outcome variable, yet unmeasured.

You also asked Tom Lumley:

Are you saying it is legitimate to estimate the confidence interval of say, the difference between the proportion of men and women answering 'Yes' but not a p-value to determine if it is zero (i.e. to reject the null)?

I think that's what Tom meant, and I agree with its application to descriptive statistics; ~~I'm not sure that it applies~~ It does not apply to causal analyses, including those generated by an experiment. Your particular example is a borderline case, as you intend the results to apply to a single population at a particular time. If someone asked you to project your findings to another setting or to another time period, the confidence interval calculation ~~probably~~ should not include the fpc.

Some additional insight can be gained by considering the experimental design as part of the sample design. If the initial random sample is of size $n$, randomization produces two random sub-samples of size $n_1 = n/2$ and $n_2 = n/2$. (For the theory that follows, $n_1$ and $n_2$ need not be equal.) Let $\overline{y}_1$ and $\overline{y}_2$ be the means of the sub-samples; proportions are special cases. In this scenario, which conforms to the absence of a treatment effect, it can be shown (Cochran, 1977, problem 2.16, p. 48) that:

\begin{equation} Var(\overline{y}_1 -\overline{y}_2) = S^2\left(\frac{1}{n_1} +\frac{1}{n_2}\right) \end{equation}

where $S^2$ is the population variance and variation is over repetitions of the sampling and randomization. Notice: no fpc.

Update: one of the few established uses of hypothesis tests + FPCs for finite populations: lot quality assurance sampling (LQAS)

I agree with Tom's answer. Hypothesis testing rarely has a place in finite population questions, but confidence intervals certainly do. One good use of hypothesis tests per se in finite populations is lot quality assurance sampling (LQAS), which tests whether the rate of some event (e.g. vaccination) in a geographic area is too high or too low. Note that, unlike the question at hand, there is no hypothesis of zero difference. The null hypothesis is that the rate is < K, and the alternative that is it is $\geq$K. See, at Google Scholar.

Robertson, Susan E, Martha Anker, Alain J Roisin, Nejma Macklai, Kristina Engstrom, and F Marc LaForce. 1997. The Lot quality technique: a global review of applications in the assessment of health services and disease surveillance. Relation 50, no. 3/4: 199-209.

Lemeshow, Stanley, and Scott Taber. 1991. Lot quality assurance sampling: single-and double-sampling plans. World Health Stat Q 44, no. 3: 115-132.

Original Answer

Using the fpc to reduce sample size makes no sense unless intend you use it in the the hypothesis-testing statistic. But that would be an error: the fpc should not be used when testing hypotheses [added about "no difference"].

The reasoning is interesting (Cochran, 1977, p.39): It is seldom of scientific interest to ask if a null hypothesis (e.g. that two proportions are equal) is exactly true in a finite population . Except by a very rare chance, the null hypothesis will never be true, as one would discover by enumerating the entire population. Therefore hypothesis tests on samples from finite populations are done from a "super-population" viewpoint. See also Deming (1966) pp 247-261 "Distinction between enumerative and analystic studies"; Korn and Graubard (1999), p. 227.

References

Cochran, W. G. (1977). Sampling techniques (3rd ed.). New York: Wiley.

Deming, W. E. (1966). Some theory of sampling. New York: Dover Publications.

Korn, E. L., & Graubard, B. I. (1999). Analysis of health surveys (Wiley series in probability and statistics). New York: Wiley.

Best Answer

Related Solutions

Solved – How to design experiments for Market Research (with a twist)

Power Analysis – Adjusting Calculations with Finite Population Correction (FPC)

Related Question