Solved – Sample survey: can I weight back to the target population from the survey population

survey

I am working for an organization that regularly polls its members. In a previous study, the researchers started with a target population frame of 50,000 people. They then eliminated 15,000 on the grounds that they had recently received other surveys, leaving a survey population of 35,000. From this, they drew a stratified sample of 4,500 people. 1,730 completed surveys were returned.

The researchers stratified on the basis of the 35,000 and calculated survey weights on that basis. However, they seem to have adjusted the weights to give results for the 50,000 — the sample weights add up to 50,000. They also did some non-response weighting, based on the observation that proportionately more women than men responded. These weights were based on stratum totals from the sample of 4,500 and the 1,730.

My questions are:

  1. Is it OK to weight back to the original, target population?
  2. If so, what should the weights be?
  3. What happens to the variance estimates?

Assume that we are interested in estimating a population total.

Note that the survey population of 35,000 is not a simple random sample of the 50,000. It is the result of removing several stratified samples from the 50,000, with non-proportional strata.

Best Answer

  1. Is it OK to weight back to the original, target population?

As a general rule, yes, it is okay, and indeed desirable, to weight back to the original target population. Your goal in these problems is usually to estimate an unknown population quantities that is aggregated over a stratified group. If the numbers of people in each group in the population is known (e.g., known number of males and females) then it is generally a good idea to weight the sample estimators in such a way that they account for the known sizes of the population groups. In this particular case, it may be dubious to make inference beyond the sampling frame of 35,000 people into the broader population of 50,000, but that is a separate issue.

  1. If so, what should the weights be?

  2. What happens to the variance estimates?

It sounds like you have a complex sampling problem, so this is a complex question that would need to be considered in light of a detailed understanding of the sampling scheme and estimation methods. However, to give you an idea of the principles involved, I will give a simpler example of a stratified sampling problem with known sizes for the population groups.

Consider the case where you have a population of size $N = N_M + N_F$ consisting of $N_M$ males and $N_F$ females. Each person has some characteristic quantified by a variable $X_i$ and you want to make inferences about the population mean $\bar{X}_N$. Suppose you sample from this population using stratified random sampling with $n_M$ males and $n_F$ females. You obtain sample means $\bar{X}_M$ and $\bar{X}_F$ for these two groups. In this case your estimator of the population mean would be:

$$\hat{\bar{X}}_N = \frac{N_M}{N_M+N_F} \cdot \bar{X}_M + \frac{N_F}{N_M+N_F} \cdot \bar{X}_F.$$

We can examine this estimator under the superpopulation approach, where the finite population is embedded in a larger model with mean and variance parameters. Under this approach it can be shown that:

$$\begin{equation} \begin{aligned} \mathbb{E}(\hat{\bar{X}}_N - \bar{X}_N) &= 0 \\[10pt] \mathbb{V}(\hat{\bar{X}}_N - \bar{X}_N) &= \frac{1}{(N_M+N_F)^2} \Bigg[ \frac{N_M (N_M - n_M)}{n_M} \cdot \sigma_M^2 + \frac{N_F (N_F - n_F)}{n_F} \cdot \sigma_F^2 \Bigg]. \end{aligned} \end{equation}$$

This gives you the quasi-pivotal quantity:

$$T = \frac{(N_M+N_F) \cdot (\hat{\bar{X}}_N - \bar{X}_N)}{\sqrt{N_M (N_M - n_M) S_M^2 / n_M + N_F (N_F - n_F) S_F^2 / n_F}} \overset{\text{Approx}}{\sim} \text{T-Dist}(DF),$$

where the degrees-of-freedom $DF$ are found using the Welch-Satterthwaite method. As you can see, the variance of the difference $\hat{\bar{X}}_N - \bar{X}_N$ is affected by the weighting in the estimator. Given a prior assumption about $\sigma_M^2$ and $\sigma_F^2$, minimisation of this variance subject to the constraint $n = n_M+n_F$ can be used as an optimisation problem to find the optimal sample sizes for the strata.

Related Question