Combining Survey Data – How to Combine Data from 5 Surveys Spanning 10 Years

meta-analysispopulationsamplingsurveyweighted-sampling

I have results from 5 surveys each 2 years apart and let us assume that no subjects are selected in more than one survey.

The sampling method used in these surveys are biased and I have sampling weights calculated(with respect to the population) for each data point in each study.

The question is, how would I be able to combine the 5 datasets and have the weights recalculated so as to obtain one giant dataset for analysis on this population?

Also, what should I do if subjects appear in more than one survey?

Updates/Further Elaboration:

thank you @user30523, here are some more infomation that might be useful:

Suppose I wish to find out the estimated distribution of height across the population using these 5 datasets.

In some data, younger people are oversampled because of the location where the survey are conducted. Let's assume the weights are calculated with respect to their age.

Eg. assuming 2% of the population are 15 years old, and the location of the survey is at a mall where 15-year-olds made up 5% of all shoppers, then sampling weight for an subject aged 15 in that survey would be calculated as 0.02 / 0.05 = 0.4. For simplicity, each person in the mall has equal chance of being surveyed and all participants complied when asked.

Given that 5 surveys are conducted in 5 different malls and each has their set of weights calculated in the same way, how would I then be able to combine all 5 datasets and recalculate the sampling weights?

P.S: I'm new to the topic on sampling weights so do correct me if I have made errors in the way I have calculated the weights.

Best Answer

I think if each dataset is already weighted to your satisfaction, then you have a couple of different options. Which one is the right one may vary based on your objectives and the particulars of your existing data collection and weighting.

  • (#1) Union all of the datasets, along with their pre-calculated weights, and that's it.

This would be the right choice if each dataset was weighted towards a proper total count and didn't over-state the importance of any individual record relative to another dataset. If one dataset was weighted to reflect Total US Population, and another dataset was weighted in place to its own total count of respondents, then this would not be the right choice.

  • (#2) Calculate a weight for each dataset to multiply by each record's existing weight

This would be the right choice if each of your datasets are of equal importance regardless of their size. Example below...

  • (#3) Union all of the raw data and re-calculate the weights on the new, entire dataset

This would be the right choice if the reasons for non-response are similar across your different surveys - it results in the simplest data for you to work with, and it's the least likely to produce extreme weights.

Example for #2: each dataset is weighted to equal importance, with this "dataset weight" being multiplied by whatever weight has already been calculated within the dataset.

> Survey 1: 100 people   weight:  2
> Survey 2: 200 people   weight:  1
> Survey 3: 300 people   weight:  2/3
> Survey 4: 150 people   weight:  4/3
> Survey 5: 250 people   weight:  4/5