I just got my hands on the ANES (American National Election Studies) 2008 data set, and would like to do some simple analysis in R. However, I've never worked with this complex of a data set before and I've run into an issue.
The survey uses oversampling and has a variable for post stratification weights. I had only the vaguest idea of what that meant, so I read the wikipedia page on it, which I understand conceptually. Unfortunately, I don't know how to manipulate R such that the post stratification weights are reflected when I do my analysis.
While conceptually, the idea of oversampling didn't confuse me, the following documentation for the R "survey" package is completely unintelligible to me. I'll show what I've found so far, and I would really appreciate either an explanation of what's going on with these methods, or, if anyone knows a simpler way to apply a post-stratification weight to a data frame of variables, I'd love to here that too.
So, I found the "survey" package from CRAN, and I have the manual, and, after looking through it, it seems that the most promising method is:
postStratify(design, strata, population, partial = FALSE, ...)
However, when I look at the documentation for what needs to be passed for each of these arguments, I'm completely lost. They are as follows:
design A survey design with replicate weights
strata A formula or data frame of post-stratifying variables
population A table, xtabs or data.frame with population frequencies
partial if TRUE, ignore population strata not present in the sample
None of these make a lot of sense to me, but I'm pretty sure that the design argument is supposed to be of a class also defined in this package:
svydesign(ids, probs=NULL, strata = NULL, variables = NULL, fpc=NULL,
data = NULL, nest = FALSE,
check.strata = !nest, weights=NULL,pps=FALSE,...)
If you notice, there are a ton of optional arguments here, which all seem to do similar types of things (at least to me, after reading the docs…).
I'm basically at a loss for why this is so complicated in R. Am I misunderstanding things? Is there a simpler way to do this? Any help would be appreciated.
Best Answer
Looking at the example for
postStratify
in the manual, you are correct: you seem to be required to give asvydesign
object (though you can if needed usesvrepdesign
to specify it instead).The
svydesign
object must haveids
; all the others are optional, though you will almost certainly wantdata
to have something to work with, and you will probably want some of the others. At this stage I would suggest you ignore all those appearing afterdata
.postStratify
also needsstrata
, the variable to post-stratify on: the example usesapiclus1$stype
which simply specifies the school type (E, M or H). It also needspopulation
which you can either specify yourself or take from some other source: the example givesdata.frame(stype=c("E","H","M"), Freq=c(4421,755,1018))
though, as you say,table
orxtabs
can be used instead.Again, you can then ignore all the other options unless you know you need them, so you can end up with something as simple as the example's
dclus1p<-postStratify(dclus1, ~stype, pop.types)
.