Solved – How to create a synthetic data set of your population or universe from a survey sample

censusmodelingsimulationsurvey

I am thinking about how best to construct a data set containing one record for every individual in the entire United States population starting from something like the American community survey or the United States decennial census public use microdata files.

Both of these "starting points" would be very large, they have between 1 and 5% of the entire United States population already.

So long as this concept is not majorly flawed from the start, having this synthetic but complete data set would make it much easier to merge on (cold-deck impute) information from other, smaller data sets.

One could obviously take a simplistic approach and – for every record – make the same number of records as that record's weights. This wouldn't be much different than just analyzing the data set with Stata's frequency weights but that will obviously create problems in smaller geographic areas.

Say you've got a small county that you know for a fact has 10,000 residents, but in your sample only ten records. Obviously you cannot simply expand those ten records to the 10,000 residents on their own. Since you'll have a very lumpy (and wrong) age, race, income distribution within the county. However, if you took a bit of information from nearby areas and projected the probabilities of age, race, income, etc. of each of those 10,000 records you're creating from scratch, you could semi-randomly create 10,000 records that would look more reasonable for that specific county.

I am unsure whether this sort of thing has not been done because it's a terrible idea for some reason or if statisticians and demographers simply haven't had the computing power to quickly deal with 300 million records of data until very recently.

Best Answer

Two projects come pretty close to what you wanted.

First, the Synthetic Population Viewer from the RTI thinktank uses 2007-2011 ACS data and makes a "synthetic" household so that they sum up to the 2010 census tract estimates.

You can find a methods explanation here:

Wheaton, W.D., J.C. Cajka, B.M. Chasteen, D.K. Wagener, P.C. Cooley, L. Ganapathi, D.J. Roberts, and J.L. Allpress. 2009. Synthesized population databases: A U.S. geospatial database for agent-based models. RTI Press paper available here.

Second, as Andy W mentioned, this is similar to dasymetric mapping where ancillary information is combined with survey data to come up with small-area estimates. A good example of this method is the work by Nagle and colleagues:

Nagle, Nicholas N. et al. “Dasymetric Modeling and Uncertainty.” Annals of the Association of American Geographers. Association of American Geographers 104.1 (2014): 80–95.

Leyk S, Buttenfield BP, Nagle NN. Transactions in Geographic Information Science. 2013. Modeling ambiguity in census microdata allocations to improve demographic small area estimates.

However, proper caution is still needed when using the output of either of these two methods, but I think you could use either of these approaches as a baseline for "cold-deck imputation" at the census tract level. Keeping in mind that cold-deck imputation should only be used under heroic assumptions of missing completely at random data.

Related Question