Solved – Sample survey: can I weight back to the target population from the survey population

survey

I am working for an organization that regularly polls its members. In a previous study, the researchers started with a target population frame of 50,000 people. They then eliminated 15,000 on the grounds that they had recently received other surveys, leaving a survey population of 35,000. From this, they drew a stratified sample of 4,500 people. 1,730 completed surveys were returned.

The researchers stratified on the basis of the 35,000 and calculated survey weights on that basis. However, they seem to have adjusted the weights to give results for the 50,000 — the sample weights add up to 50,000. They also did some non-response weighting, based on the observation that proportionately more women than men responded. These weights were based on stratum totals from the sample of 4,500 and the 1,730.

My questions are:

Is it OK to weight back to the original, target population?
If so, what should the weights be?
What happens to the variance estimates?

Assume that we are interested in estimating a population total.

Note that the survey population of 35,000 is not a simple random sample of the 50,000. It is the result of removing several stratified samples from the 50,000, with non-proportional strata.

Best Answer

Is it OK to weight back to the original, target population?

As a general rule, yes, it is okay, and indeed desirable, to weight back to the original target population. Your goal in these problems is usually to estimate an unknown population quantities that is aggregated over a stratified group. If the numbers of people in each group in the population is known (e.g., known number of males and females) then it is generally a good idea to weight the sample estimators in such a way that they account for the known sizes of the population groups. In this particular case, it may be dubious to make inference beyond the sampling frame of 35,000 people into the broader population of 50,000, but that is a separate issue.

If so, what should the weights be?

What happens to the variance estimates?

It sounds like you have a complex sampling problem, so this is a complex question that would need to be considered in light of a detailed understanding of the sampling scheme and estimation methods. However, to give you an idea of the principles involved, I will give a simpler example of a stratified sampling problem with known sizes for the population groups.

Consider the case where you have a population of size $N = N_M + N_F$ consisting of $N_M$ males and $N_F$ females. Each person has some characteristic quantified by a variable $X_i$ and you want to make inferences about the population mean $\bar{X}_N$. Suppose you sample from this population using stratified random sampling with $n_M$ males and $n_F$ females. You obtain sample means $\bar{X}_M$ and $\bar{X}_F$ for these two groups. In this case your estimator of the population mean would be:

$$\hat{\bar{X}}_N = \frac{N_M}{N_M+N_F} \cdot \bar{X}_M + \frac{N_F}{N_M+N_F} \cdot \bar{X}_F.$$

We can examine this estimator under the superpopulation approach, where the finite population is embedded in a larger model with mean and variance parameters. Under this approach it can be shown that:

$$\begin{equation} \begin{aligned} \mathbb{E}(\hat{\bar{X}}_N - \bar{X}_N) &= 0 \\[10pt] \mathbb{V}(\hat{\bar{X}}_N - \bar{X}_N) &= \frac{1}{(N_M+N_F)^2} \Bigg[ \frac{N_M (N_M - n_M)}{n_M} \cdot \sigma_M^2 + \frac{N_F (N_F - n_F)}{n_F} \cdot \sigma_F^2 \Bigg]. \end{aligned} \end{equation}$$

This gives you the quasi-pivotal quantity:

$$T = \frac{(N_M+N_F) \cdot (\hat{\bar{X}}_N - \bar{X}_N)}{\sqrt{N_M (N_M - n_M) S_M^2 / n_M + N_F (N_F - n_F) S_F^2 / n_F}} \overset{\text{Approx}}{\sim} \text{T-Dist}(DF),$$

where the degrees-of-freedom $DF$ are found using the Welch-Satterthwaite method. As you can see, the variance of the difference $\hat{\bar{X}}_N - \bar{X}_N$ is affected by the weighting in the estimator. Given a prior assumption about $\sigma_M^2$ and $\sigma_F^2$, minimisation of this variance subject to the constraint $n = n_M+n_F$ can be used as an optimisation problem to find the optimal sample sizes for the strata.

Related Solutions

Solved – Does SurveyMonkey ignore the fact that you get a non-random sample

The short answer is yes: Survey Monkey ignores exactly how you obtained your sample. Survey Monkey is not smart enough to assume that what you have gathered isn't a convenience sample, but virtually every Survey Monkey survey is a convenience sample. This creates massive discrepancy in exactly what you're estimating which no amount of sheer sampling can/will eliminate. On one hand you could define a population (and associations therein) you would obtain from a SRS. On the other, you could define a population defined by your non-random sampling, the associations there you can estimate (and the power rules hold for such values). It's up to you as a researcher to discuss the discrepancy and let the reader decide exactly how valid the non-random sample could be in approximating a real trend.

As a point, there are inconsistent uses of the term bias. In probability theory, the bias of an estimator is defined by $\mbox{Bias}_n = \theta - \hat{\theta}_n$. However an estimator can be biased, but consistent, so that bias "vanishes" in large samples, such as the bias of maximum likelihood estimates of the standard deviation of normally distributed RVs. i.e. $\hat{\theta} \rightarrow_p \theta$. Estimators which don't have vanishing bias, (e.g. $\hat{\theta} \not\to_p \theta$) are called inconsistent in probability theory. Study design experts (like epidemiologists) have picked up a bad habit of calling inconsistency "bias". In this case, it's selection bias or volunteer bias. It's certainly a form of bias, but inconsistency implies that no amount of sampling will ever correct the issue.

In order to estimate population level associations from convenience sample data, you would have to correctly identify the sampling probability mechanism and use inverse probability weighting in all of your estimates. In very rare situations does this make sense. Identifying such a mechanism is next to impossible in practice. A time that it can be done is in a cohort of individuals with previous information who are approached to fill out a survey. Nonresponse probability can be estimated as a function of that previous information, e.g. age, sex, SES, ... Weighting gives you a chance to extrapolate what results would have been in the non-responder population. Census is a good example of the involvement of inverse probability weighting for such analyses.

Solved – R survey package: finite population correction affects point estimate in addition to the variance estimate

yes, they will give different estimates. ?svydesign says "If population sizes are specified but not sampling probabilities or weights, the sampling probabilities will be computed from the population sizes assuming simple random sampling within strata."

looking inside survey:::svydesign.default

if (is.null(probs) && is.null(weights)) {
    if (is.null(fpc$popsize)) {
        if (missing(probs) && missing(weights)) 
            warning("No weights or probabilities supplied, assuming equal probability")
        probs <- rep(1, nrow(ids))
    }
    else {
        probs <- 1/weights(fpc, final = FALSE)
    }
}

so if weights are not specified by the user but the fpc is, then the stratified fpc gets used in the computation for the weights (which will affect point estimates as well as variance calculations)

library(survey)
data(api)

dstrat1<-svydesign(id=~1,strata=~stype, data=apistrat, fpc=~fpc)
dstrat2<-svydesign(id=~1,strata=~stype, data=apistrat)

svymean( ~ api00 , dstrat1 )
svymean( ~ api00 , dstrat2 )

Best Answer

Related Solutions

Solved – Does SurveyMonkey ignore the fact that you get a non-random sample

Solved – R survey package: finite population correction affects point estimate in addition to the variance estimate

Related Question