Solved – How to create a synthetic data set of your population or universe from a survey sample

censusmodelingsimulationsurvey

I am thinking about how best to construct a data set containing one record for every individual in the entire United States population starting from something like the American community survey or the United States decennial census public use microdata files.

Both of these "starting points" would be very large, they have between 1 and 5% of the entire United States population already.

So long as this concept is not majorly flawed from the start, having this synthetic but complete data set would make it much easier to merge on (cold-deck impute) information from other, smaller data sets.

One could obviously take a simplistic approach and – for every record – make the same number of records as that record's weights. This wouldn't be much different than just analyzing the data set with Stata's frequency weights but that will obviously create problems in smaller geographic areas.

Say you've got a small county that you know for a fact has 10,000 residents, but in your sample only ten records. Obviously you cannot simply expand those ten records to the 10,000 residents on their own. Since you'll have a very lumpy (and wrong) age, race, income distribution within the county. However, if you took a bit of information from nearby areas and projected the probabilities of age, race, income, etc. of each of those 10,000 records you're creating from scratch, you could semi-randomly create 10,000 records that would look more reasonable for that specific county.

I am unsure whether this sort of thing has not been done because it's a terrible idea for some reason or if statisticians and demographers simply haven't had the computing power to quickly deal with 300 million records of data until very recently.

Best Answer

Two projects come pretty close to what you wanted.

First, the Synthetic Population Viewer from the RTI thinktank uses 2007-2011 ACS data and makes a "synthetic" household so that they sum up to the 2010 census tract estimates.

You can find a methods explanation here:

Wheaton, W.D., J.C. Cajka, B.M. Chasteen, D.K. Wagener, P.C. Cooley, L. Ganapathi, D.J. Roberts, and J.L. Allpress. 2009. Synthesized population databases: A U.S. geospatial database for agent-based models. RTI Press paper available here.

Second, as Andy W mentioned, this is similar to dasymetric mapping where ancillary information is combined with survey data to come up with small-area estimates. A good example of this method is the work by Nagle and colleagues:

Nagle, Nicholas N. et al. “Dasymetric Modeling and Uncertainty.” Annals of the Association of American Geographers. Association of American Geographers 104.1 (2014): 80–95.

Leyk S, Buttenfield BP, Nagle NN. Transactions in Geographic Information Science. 2013. Modeling ambiguity in census microdata allocations to improve demographic small area estimates.

However, proper caution is still needed when using the output of either of these two methods, but I think you could use either of these approaches as a baseline for "cold-deck imputation" at the census tract level. Keeping in mind that cold-deck imputation should only be used under heroic assumptions of missing completely at random data.

Related Solutions

Solved – R : using survey package to run t-test on sub population of weighted data set

you want

svyttest( INCOME ~ GENDER , subset( example.survey , RELOCATE == 1 ) )

Solved – Sample survey: can I weight back to the target population from the survey population

Is it OK to weight back to the original, target population?

As a general rule, yes, it is okay, and indeed desirable, to weight back to the original target population. Your goal in these problems is usually to estimate an unknown population quantities that is aggregated over a stratified group. If the numbers of people in each group in the population is known (e.g., known number of males and females) then it is generally a good idea to weight the sample estimators in such a way that they account for the known sizes of the population groups. In this particular case, it may be dubious to make inference beyond the sampling frame of 35,000 people into the broader population of 50,000, but that is a separate issue.

If so, what should the weights be?

What happens to the variance estimates?

It sounds like you have a complex sampling problem, so this is a complex question that would need to be considered in light of a detailed understanding of the sampling scheme and estimation methods. However, to give you an idea of the principles involved, I will give a simpler example of a stratified sampling problem with known sizes for the population groups.

Consider the case where you have a population of size $N = N_M + N_F$ consisting of $N_M$ males and $N_F$ females. Each person has some characteristic quantified by a variable $X_i$ and you want to make inferences about the population mean $\bar{X}_N$. Suppose you sample from this population using stratified random sampling with $n_M$ males and $n_F$ females. You obtain sample means $\bar{X}_M$ and $\bar{X}_F$ for these two groups. In this case your estimator of the population mean would be:

$$\hat{\bar{X}}_N = \frac{N_M}{N_M+N_F} \cdot \bar{X}_M + \frac{N_F}{N_M+N_F} \cdot \bar{X}_F.$$

We can examine this estimator under the superpopulation approach, where the finite population is embedded in a larger model with mean and variance parameters. Under this approach it can be shown that:

$$\begin{equation} \begin{aligned} \mathbb{E}(\hat{\bar{X}}_N - \bar{X}_N) &= 0 \\[10pt] \mathbb{V}(\hat{\bar{X}}_N - \bar{X}_N) &= \frac{1}{(N_M+N_F)^2} \Bigg[ \frac{N_M (N_M - n_M)}{n_M} \cdot \sigma_M^2 + \frac{N_F (N_F - n_F)}{n_F} \cdot \sigma_F^2 \Bigg]. \end{aligned} \end{equation}$$

This gives you the quasi-pivotal quantity:

$$T = \frac{(N_M+N_F) \cdot (\hat{\bar{X}}_N - \bar{X}_N)}{\sqrt{N_M (N_M - n_M) S_M^2 / n_M + N_F (N_F - n_F) S_F^2 / n_F}} \overset{\text{Approx}}{\sim} \text{T-Dist}(DF),$$

where the degrees-of-freedom $DF$ are found using the Welch-Satterthwaite method. As you can see, the variance of the difference $\hat{\bar{X}}_N - \bar{X}_N$ is affected by the weighting in the estimator. Given a prior assumption about $\sigma_M^2$ and $\sigma_F^2$, minimisation of this variance subject to the constraint $n = n_M+n_F$ can be used as an optimisation problem to find the optimal sample sizes for the strata.

Best Answer

Related Solutions

Solved – R : using survey package to run t-test on sub population of weighted data set

Solved – Sample survey: can I weight back to the target population from the survey population

Related Question