Solved – Using the sample weights in regression

samplingweighted-sampling

I have dataset A where households are interviewed each quarter for one year. I am focussing only on the households who have data for all quarters. With this imposition, observations are reduced by more than half. Next, I have dataset B where there is a variable "notinA" which is not in data A, but is in dataB. I am trying to impute this variable in data A using common demographic variables (available in both datasets) and nearest neighborhood matching. In this case all of the observations in data B are not used and some of the observations are used more than once when imputing the variable "notinA" in A from data B. Moreover, there are some observations in dataA for which the match is not possible, which are then dropped. So, my question is whether it is justifiable to use sample weights (pweights) of dataset A for the final data (with inputted variable from dataB). Can you please point to the literature that suggests how to deal with the sample weights for the example like this?

Best Answer

The answer is "No", you can't use the A weights as is.

You've imputed your new variable to a subset of dataset A. Call it A'. Now A' does not represent the original A sample, so you must do what is known as an "inverse probability weighting". A google search will turn up many references. Briefly: estimate the probability of successful imputation; call it p_mi. If the study weight is w*, then the final weight will be w*/p_mi. (Technically, you've re-weighted.) The p_mi model can be based any characteristics known for members of A not just the demographic variables they share with B. The predictors can also include design characteristics, such as the initial study weights, strata, and stratum characteristics.

A couple of other issues:

If you are not using multiple imputation (MI), do so: otherwise analyses with the new variable will not account for the uncertainty introduced by imputing notinA. Edit: Note that an imputation model is one that predicts `notinA' in the B population.

Re-weighting the four-survey data set: You don't say how you weighted the final data set to account for losses of subjects after survey 1. The general procedure is another application of inverse probability weighting of the study 1 weights, say w1: Estimate, from the survey 1 data, the probability p_all of being in all four surveys. The covariates would be the design variables and other characteristics known at survey 1. Then w* = w1/p_all. Try a google, or google scholar, search on "panel weighting" and "longitudinal weighting" for applications specific to this area.

Related Question