I am thinking about how best to construct a data set containing one record for every individual in the entire United States population starting from something like the American community survey or the United States decennial census public use microdata files.
Both of these "starting points" would be very large, they have between 1 and 5% of the entire United States population already.
So long as this concept is not majorly flawed from the start, having this synthetic but complete data set would make it much easier to merge on (cold-deck impute) information from other, smaller data sets.
One could obviously take a simplistic approach and – for every record – make the same number of records as that record's weights. This wouldn't be much different than just analyzing the data set with Stata's frequency weights but that will obviously create problems in smaller geographic areas.
Say you've got a small county that you know for a fact has 10,000 residents, but in your sample only ten records. Obviously you cannot simply expand those ten records to the 10,000 residents on their own. Since you'll have a very lumpy (and wrong) age, race, income distribution within the county. However, if you took a bit of information from nearby areas and projected the probabilities of age, race, income, etc. of each of those 10,000 records you're creating from scratch, you could semi-randomly create 10,000 records that would look more reasonable for that specific county.
I am unsure whether this sort of thing has not been done because it's a terrible idea for some reason or if statisticians and demographers simply haven't had the computing power to quickly deal with 300 million records of data until very recently.
Best Answer
Two projects come pretty close to what you wanted.
First, the Synthetic Population Viewer from the RTI thinktank uses 2007-2011 ACS data and makes a "synthetic" household so that they sum up to the 2010 census tract estimates.
You can find a methods explanation here:
Second, as Andy W mentioned, this is similar to dasymetric mapping where ancillary information is combined with survey data to come up with small-area estimates. A good example of this method is the work by Nagle and colleagues:
However, proper caution is still needed when using the output of either of these two methods, but I think you could use either of these approaches as a baseline for "cold-deck imputation" at the census tract level. Keeping in mind that cold-deck imputation should only be used under heroic assumptions of missing completely at random data.