Solved – Why it is important to make survey design object (svydesign function in R with id, strata, weights, fpc) from raw data and after clean data in object

rsurveysurvey-samplingsurvey-weights

I am planning to analyse a survey. I have been told that if I clean the data (e.g. data subsetting, value recoding, creating new variables from existing etc.) and afterwards create a survey design object (svydesign function in "survey" package of R with id, strata, weights, fpc), I may get not correct point estimates and CI. I have been advised to make the survey design object first and then clean the data in that object. Can you please explain to me why this is necessary?

Best Answer

There are two separate issues here.

Sometimes, including with NHANES data, you do need to subset before defining the survey design object, because not all the records in the data set are part of the sample you are analysing. In NHANES, everyone in the data file will have a health questionnaire, but only a subset will have a clinical examination, and there may be smaller subsets with specific biochemical measurements. You need to remove records from the file that are not part of the sample you are analysing.

For example, I might use something like

nhanesmec <- subset(nhanes, !is.na(WTMEC2YR))

to analyse data from the clinical examination. Records with missing WTMEC2YR are not part of the MEC sample and so should not go into the survey design object.

On the other hand, if you have observations that are part of the sample, you should not remove them even if they have missing or implausible data, and you should not, eg, remove records for men if you want to do analyses only for women.

The reason is complicated, and makes almost no difference for NHANES. However, you asked.

Let's ignore stratified sampling for now and just consider cluster sampling. The survey was designed to sample a specific, preplanned number of clusters. When we're thinking about how different the results could be with a hypothetical replication of the survey [the frequentist definition of sampling uncertainty] we want to think about hypothetical replications that have the same preplanned number of clusters.

If you subset the data and end up removing one of the clusters, you no longer have the preplanned number of clusters. The number of clusters is now random, and you'd need to model the resulting variability due to number of clusters.

The computations to get the correct variances are equivalent to setting the weight to zero when you want to omit an observation, rather than just omitting it; we keep the number of clusters the same. If you look at the output of summary on a subsetted survey design object you can see the object keeps track of how many clusters (PSUs) it has data for (which is random), and also how many it started with (which is fixed). Using Anthony's example

library(survey)

data(api)

dstrat_after<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
dstrat_after <- subset( dstrat_after , comp.imp == 'Yes' )
summary(dstrat_after)

you will see as part of the output

Stratum Sizes: 
             E  H  M
obs         75 17 24
design.PSU 100 50 50
actual.PSU  75 17 24

If you subset before setting up the survey design object, there's no way for the object to know the planned number of observations or clusters, so there's no way for it to get the right standard errors.

And finally: this only makes a difference when your subset has fewer primary sampling units than the full sample. In Anthony's example the PSUs are individual records, so the subset does have fewer. In NHANES the PSUs are cities or counties, so you'd have to remove a lot of observations before you lost a PSU. Also, since the design has only two PSUs per stratum, if you did lose a PSU you'd have other problems with estimating standard errors and would have to look up survey.lonely.psu.

So, for NHANES it's unlikely to actually matter.