I have dataset A where households are interviewed each quarter for one year. I am focussing only on the households who have data for all quarters. With this imposition, observations are reduced by more than half. Next, I have dataset B where there is a variable "notinA" which is not in data A, but is in dataB. I am trying to impute this variable in data A using common demographic variables (available in both datasets) and nearest neighborhood matching. In this case all of the observations in data B are not used and some of the observations are used more than once when imputing the variable "notinA" in A from data B. Moreover, there are some observations in dataA for which the match is not possible, which are then dropped. So, my question is whether it is justifiable to use sample weights (pweights) of dataset A for the final data (with inputted variable from dataB). Can you please point to the literature that suggests how to deal with the sample weights for the example like this?
Solved – Using the sample weights in regression
samplingweighted-sampling
Related Solutions
Update 2014-04-04: Create the reasonably-sized weights from the logarithms. See below:
I don't see that a logarithmic approach is needed. To deal with very large weights, divide each by a large constant, e.g. $10^3$ or $10^4$, which will be a simple matter of moving the decimal point. Then apply the standard formulas for weighted means; these are invariant to changes in the weights of the form $w'=C\thinspace w$, because the constant $C$ cancels out in numerator and denominator. Similar remarks apply to weighted estimates of variance.
Update: Get revised weight $w'$ from logs
If $C = $ e.g. $10^3$ or $10^4$, and $\log(w)$ is the log weight,
$$ w' = \frac{w}{C} = \exp(\log(w)-\log(C)) $$ which for $C = 10^4$ would be
$$ w' = \exp(\log(w)-4\log(10)) $$
Note that the sum in your last expression
$$ \sum_i \text{exp }\left(\text{ln } a_i -\text{ln }a_0 \right) $$
is equivalent to writing
$$ \sum_i \left(\frac{a_i}{a_0}\right) $$
This is just a standardization of each $a_i, i\gt 0$, by the first term. Thus your proposal cannot escape summing the weights.
I think the question you should focus on is the population you want to make inference about. The sample weights are for making estimates of population totals $\hat{Y}=\sum_i w_i y_i$. The weights are for getting from the sample to inference about a specific population. What you should think about is whether that is your population of interest.
So, if you wanted to fit say, a linear regression model you need to have all the population sums, sum of squares, and sum of cross products over the population. Using the survey weights gives you an estimate of these quantities.
You could even say that the "population log likelihood" is a sum as we have $$L_p(\theta) = \sum_i \log\left(f(y_i|\theta)\right)$$
For some log likelihood $\log\left(f(y_i|\theta)\right)$. Using the sample weights essentially provides an estimate of this quantity - taking $y_i=\log\left(f(y_i|\theta)\right)$ in the previous equation for $\hat{Y}$.
However, you are likely to run into problems with standard variance estimation in most modelling programs. The standard errors will be far too small. Conceptually it makes sense, if we consider these as estimates of standard errors we would get from fitting models to census data. That is, we expect $L_P(\theta)$ to be quite sharply peaked. But the problem is that we are using an estimate for $\hat{L}_P(\theta)$ and this estimate has error that needs to be taken account of. Usually there are jackknife/bootstrap weights provided with these kinds of files, and using the variation in these gives you a more reasonable estimate of uncertainty.
This is also a place where your "bayes, model based" / "frequentist, design based" philosophy matters somewhat because your variance estimates depend on what you are conditioning on as fixed. ie is the error from "the sample selected" or from "the predictive model"?
It is also not a bad idea to simply check if using or not using the weights makes a difference to your analysis, noting that you should expect a difference in accuracy measures but not necessarily parameter estimates (such as regression coefficients)
Best Answer
The answer is "No", you can't use the A weights as is.
You've imputed your new variable to a subset of dataset A. Call it A'. Now A' does not represent the original A sample, so you must do what is known as an "inverse probability weighting". A google search will turn up many references. Briefly: estimate the probability of successful imputation; call it p_mi. If the study weight is w*, then the final weight will be w*/p_mi. (Technically, you've re-weighted.) The p_mi model can be based any characteristics known for members of A not just the demographic variables they share with B. The predictors can also include design characteristics, such as the initial study weights, strata, and stratum characteristics.
A couple of other issues:
If you are not using multiple imputation (MI), do so: otherwise analyses with the new variable will not account for the uncertainty introduced by imputing
notinA
. Edit: Note that an imputation model is one that predicts `notinA' in the B population.Re-weighting the four-survey data set: You don't say how you weighted the final data set to account for losses of subjects after survey 1. The general procedure is another application of inverse probability weighting of the study 1 weights, say w1: Estimate, from the survey 1 data, the probability p_all of being in all four surveys. The covariates would be the design variables and other characteristics known at survey 1. Then w* = w1/p_all. Try a google, or google scholar, search on "panel weighting" and "longitudinal weighting" for applications specific to this area.