Solved – R: Do I have to use sample-weights for calculations inside a bootstrap function that allready uses sample weights

bootstraprsurvey-weights

I am using the boot function in R to get standard errors for several statistics (I am doing a oaxaca blinder decomposition). My data (EU-SILC) has sample weights (PB040) for every observation. My understanding is that I have to use those weights for every OLS regression or other package I run to correct for sampling errors of the survey.

The thing is now that the boot package in R also can use sample weights.

Should I use the same sample weights for the call of the boot function and for the calculations "inside" the boot function?

My reason for this question is that since the boot function does a weighted draw from the original sample the function "inside" the boot-function should allready have a weighted sample. So using sample weights inside the boot function would be redundant or lead to a bias.

This is my first posting to cross validated and I am pretty new to bootstraping and econometrics so I hope I asked the question concise enough.

Best Answer

If you are working with a typical labor force survey (LFS), chances are it has more than just the weights in the mix, like stratification and clustering. These have to be accounted for appropriately. R's survey package can create the bootstrap weights that account for these aspects of the survey; I doubt boot package does this properly. That said, you may have to code Oaxaca-Binder decomposition from scratch using the results from svyglm.

To my amazement and despair, econometric references often give technically incorrect advice on how to deal with complex survey data. It seems like few if any econometricians have ever taken a course in sampling, and understand finite population inference, and why survey statisticians create all these complications of weights, clusters and strata. (The answer is, to make the damn thing work at all; it is all nice to assume rational expectations for a dissertation paper, but in the real world, you have to deal with limited budgets, non-existent lists of observation units, and requests to optimize the survey for this and that and yet another statistic.) (To be fair, that's a complete rehaul of one's thinking; econometricians are all about models, while finite population inference is all about how to avoid models and deal with non-parametric sampling inference; it took me some 2-3 years to get into the groove; I started as an econometrician, and now do surveys for living.) For a 50-page introduction, you can take a look at my chapter in Handbook of Health Survey Methods. For a book length treatment, you would want to start with Heeringa, West and Berglund (2010) or Lumley (2010).

As a side comment, there may be better decompositions available on the market.

Related Question