Solved – Pooling data from two different samples: Does the scale of the sampling weights matter

poolingsurveysurvey-weights

Background

A colleague recently came to me with a problem. He was tasked with comparing health service utilization indicators in two secondary datasets:

  1. The first dataset is a Demographic and Health Survey (DHS) dataset,
    which has around 9,000 observations and is a nationally
    representative sample of women 15-49 in Country X. Individuals were
    sampled using stratified two-stage cluster sampling. The strata for
    this sample were the province of residence and whether the cluster
    was in an urban area or rural area.
  2. The survey instrument for the second dataset was modeled after the DHS. The same sampling strategy used by the DHS Program was used here. It has around 2,000 observations and is a representative sample of women 15-49 in 3 of the 11 provinces in Country X. As with the DHS, the strata were the province of residence and whether the cluster was in an urban or rural area.

Both datasets contain sampling weights, but the scales are different. For the first dataset, the party that processes the data generally multiplies the weight by 100,000 to preserve decimal places. The documentation urges users to divide the weight variable that is shipped with the dataset by 100,000 before using. For the second dataset, the party that processes the data did not transform the weight variable further, so that the variable could be used as-is.

Problem

The "scale" of the weight–whether divided by 100,000 first or used as is–doesn't really matter when working within a single survey for point estimates of proportions, means, or parameters, as this kind of transformation only affects the "effective" number of observations (i.e. 1,000/1,000,000 is equivalent to 0.01/10). What I am not sure about is whether the weights necessarily need to be re-scaled when the data are pooled. The DHS documentation for sampling states that when pooling DHS datasets, the weights need to be "de-normalized" before using (ICF International, 2012, p. 28) by multiplying the weight by the target population and dividing this by the number of completed cases (in other words, sum of the weights), for each survey, because the given sampling weights are country-time specific. My inclination is that once the weights are de-normalized, it is not necessary to ensure that they are the same scale, as he is only interested in the differences in proportions between the two datasets. Is this correct, or will having variables of different scales be a problem when doing regression?

Reference

  1. ICF International. 2012. Demographic and Health Survey Sampling and
    Household Listing Manual. Calverton, Maryland, USA: ICF
    International.
    http://dhsprogram.com/pubs/pdf/DHSM4/DHS6_Sampling_Manual_Sept2012_DHSM4.pdf

Best Answer

What DHS does with weights is beyond me. I think their intent with division by 100,000 is to make weights sum up to the nominal sample size of 9,000. But this is an awkward scale of weights. The proper scale should be the population of the country (or, rather, as DHS surveys the specific population of women in their fertile ages, total number of women aged 15-49). ICF computes the weights properly stringing the probabilities of selection, but chooses to destroy the scale. Ah well. So as your first step, you would need to scale the weights in the DHS sample up so that they sum up to the population total. (My guess though is since that total is usually not known very accurately, DHS sweeps the issue under the carpet, and just makes a poker face with the weights that sum up to the sample size.)

Likewise, your additional sample should be scaled so that the weights sum up to the target population in your three provinces.

Once that is done, you can combine the weights using a version of the single frame estimation method (Lohr 2009). Since weights are inverse probabilities of selection, the combined weight should be the inverse of the combined probability of selection: $$ w_i^c = 1/\pi_i^c = 1/[1-(1-\pi_1^1)(1-\pi_i^2)] \approx 1/(\pi_i^1 + \pi_i^2) = 1/(1/w_i^1 + 1/w_i^2) $$ for the observations in the three provinces that were sampled twice, while the observations in the remaining provinces just retain their DHS weight.

Related Question