Solved – R : using survey package to run t-test on sub population of weighted data set

rsurvey

I have a large set of weighted data. I have loaded it into a survey design and would now like to run t-tests on sub-populations.

example:

DF<-cbind(ID, WEIGHT, GENDER, INCOME)

     ID WEIGHT GENDER RELOCATE INCOME
[1,]  1   4380      1        1     35
[2,]  2   5000      1        1     20
[3,]  3      0      0        1     55
[4,]  4   5640      1        0     60
[5,]  5   6120      0        1     25

example.survey<-svydesign(ids=~0, data=DF, weights=WEIGHT)

I am able to call the mean income for the entire sample by:

svymean(INCOME, example.survey)

       mean    SE
[1,] 35.227 9.043

However, I want to compare the means for a subpopulation of this sample so that I can maintain the proper weights.

Can you confirm that this is the proper syntax to run a t-test comparing the mean INCOME based on GENDER for those who relocated (RELOCATE==1)?

svyttest(INCOME~GENDER+RELOCATE==1, example.survey)

data:  INCOME ~ GENDER + RELOCATE == 1
t = 0.9841, df = 2, p-value = 0.4288
alternative hypothesis: true difference in mean is not equal to 0
sample estimates:
difference in mean 
          14.78145

Best Answer

you want

svyttest( INCOME ~ GENDER , subset( example.survey , RELOCATE == 1 ) )

Related Solutions

Solved – Using post-stratification weights in R survey package

If people say they have post-stratified weights, it does not necessarily mean they implemented post-stratification, proper (as in, rescaled the weights in each demographic cell to the known population total). About 80% of usage that I hear of "post-stratified weights" actually refers to calibrated weights (i.e., rather than trying to adjust each and every cell in a five-way table, the weights are only adjusted to match each of the five variables of the table individually). I produced what somebody referred to as a methodological rant on the distinction. The distinction, however, plays a role in standard error calculations, as Anthony noted in another answer. With properly post-stratified weights, you can apply the regular variance estimation formulae, more or less treating your post-strata as sampling strata (minor technicalities aside). With weights that are only calibrated on each table margin, computations are somewhat more involved. Both procedures are internalized in survey package, anyway, though. You just need to feed your post-stratification/calibration variables to the appropriate design object/formula.

library(survey)
data(api)
# cross-classified post-stratification variable in population
apipop$stype.sch.wide <- 10*as.integer(apipop$stype) +
as.integer(apipop$sch.wide)
# cross-classified post-stratification variable in sample
apiclus1$stype.sch.wide <- 
  10*as.integer(apiclus1$stype) + as.integer(apiclus1$sch.wide)
# population totals
(pop.totals <- xtabs(~stype.sch.wide, data=apipop))
# reference design
dclus1 <- svydesign(id=~dnum,weights=~pw,data=apiclus1,fpc=~fpc)
# post-stratification of the original design
dclus1p <- postStratify(dclus1,~stype.sch.wide, pop.totals)
# design with post-stratified weights, but no evidence of post-stratification
dclus1pfake <- svydesign(id=~dnum,weights=~weights(dclus1p),data=apiclus1,fpc=~fpc)
# taking off the design with known weights, add post-stratification interaction
dclus1pp <- postStratify(dclus1pfake,~stype.sch.wide, pop.totals)

# estimates and standard errors: starting point
svymean(~api00,dclus1)
# post-stratification reduces standard errors a bit
svymean(~api00,dclus1p)
# but here we are not aware of the survey being post-stratified
svymean(~api00,dclus1pfake)
# if we just add post-stratification variables to the design object
# that only had post-stratified weights, the result is the same
# as for post-stratified object based on the original weights
svymean(~api00,dclus1pp)

Solved – R survey package: finite population correction affects point estimate in addition to the variance estimate

yes, they will give different estimates. ?svydesign says "If population sizes are specified but not sampling probabilities or weights, the sampling probabilities will be computed from the population sizes assuming simple random sampling within strata."

looking inside survey:::svydesign.default

if (is.null(probs) && is.null(weights)) {
    if (is.null(fpc$popsize)) {
        if (missing(probs) && missing(weights)) 
            warning("No weights or probabilities supplied, assuming equal probability")
        probs <- rep(1, nrow(ids))
    }
    else {
        probs <- 1/weights(fpc, final = FALSE)
    }
}

so if weights are not specified by the user but the fpc is, then the stratified fpc gets used in the computation for the weights (which will affect point estimates as well as variance calculations)

library(survey)
data(api)

dstrat1<-svydesign(id=~1,strata=~stype, data=apistrat, fpc=~fpc)
dstrat2<-svydesign(id=~1,strata=~stype, data=apistrat)

svymean( ~ api00 , dstrat1 )
svymean( ~ api00 , dstrat2 )

Best Answer

Related Solutions

Solved – Using post-stratification weights in R survey package

Solved – R survey package: finite population correction affects point estimate in addition to the variance estimate

Related Question