Solved – Using post-stratification weights in R survey package

rstratificationsurveysurvey-samplingsurvey-weights

I am analyzing a dataset that has a variable for post-stratification weights. As this is a complex survey, the plan is to use the R survey package. I have been reading its documentation and feel like able to set a survey design correctly. So far, so good. That said, one aspect is still not clear for me.

Lumley says that survey assumes weights are sampling weights — i.e. 1/(prop of selection for that observation):

Survey designs are specified using the svydesign function. The main arguments to the the function are id to specify sampling units (PSUs and optionally later stages), strata to specify strata, weights to specify sampling weights, and fpc to specify finite population size corrections. These arguments should be given as formulas, referring to columns in a data frame given as the data argument. (http://r-survey.r-forge.r-project.org/survey/example-design.html)

My dataset does not include a variable for sampling weights. Its weight is a post-stratification weight accounting for probability of selection, unit non-responses, and post-stratifies the sample to match the age and gender joint distribution. The post-stratification weight is rescaled to sample size — there are 1,000 observations so sum(poststratification.wt)=1,000, ranging from ~0.9 to ~5.5. I have closely inspected the data and the info available does not allow me to estimate the probability weights from the scratch.

So my question is: Am I safe, or roughly safe, using the provided post-stratification weight in the svydesign(weights=) argument? If not, what should I do? (Running a 1,000 survey is out of my budget possibility, hehe).

Best Answer

If people say they have post-stratified weights, it does not necessarily mean they implemented post-stratification, proper (as in, rescaled the weights in each demographic cell to the known population total). About 80% of usage that I hear of "post-stratified weights" actually refers to calibrated weights (i.e., rather than trying to adjust each and every cell in a five-way table, the weights are only adjusted to match each of the five variables of the table individually). I produced what somebody referred to as a methodological rant on the distinction. The distinction, however, plays a role in standard error calculations, as Anthony noted in another answer. With properly post-stratified weights, you can apply the regular variance estimation formulae, more or less treating your post-strata as sampling strata (minor technicalities aside). With weights that are only calibrated on each table margin, computations are somewhat more involved. Both procedures are internalized in survey package, anyway, though. You just need to feed your post-stratification/calibration variables to the appropriate design object/formula.

library(survey)
data(api)
# cross-classified post-stratification variable in population
apipop$stype.sch.wide <- 10*as.integer(apipop$stype) +
as.integer(apipop$sch.wide)
# cross-classified post-stratification variable in sample
apiclus1$stype.sch.wide <- 
  10*as.integer(apiclus1$stype) + as.integer(apiclus1$sch.wide)
# population totals
(pop.totals <- xtabs(~stype.sch.wide, data=apipop))
# reference design
dclus1 <- svydesign(id=~dnum,weights=~pw,data=apiclus1,fpc=~fpc)
# post-stratification of the original design
dclus1p <- postStratify(dclus1,~stype.sch.wide, pop.totals)
# design with post-stratified weights, but no evidence of post-stratification
dclus1pfake <- svydesign(id=~dnum,weights=~weights(dclus1p),data=apiclus1,fpc=~fpc)
# taking off the design with known weights, add post-stratification interaction
dclus1pp <- postStratify(dclus1pfake,~stype.sch.wide, pop.totals)

# estimates and standard errors: starting point
svymean(~api00,dclus1)
# post-stratification reduces standard errors a bit
svymean(~api00,dclus1p)
# but here we are not aware of the survey being post-stratified
svymean(~api00,dclus1pfake)
# if we just add post-stratification variables to the design object
# that only had post-stratified weights, the result is the same
# as for post-stratified object based on the original weights
svymean(~api00,dclus1pp)
Related Question