R Regression – How to Run svy: Regress or Get R-Squared for Complex Survey Data

rregressionstatasurvey

I am trying to get r-squared, or explained variation, in a complex survey data using a linear regression (OLS).

In Stata, this can be done by using svy: regress. In R, however, when I use 'survey' package, there is no option for OLS linear regression. There is svyglm, which is generalized linear model (GLM), but this does not provide a value for explained variation (r-squared) because it isn't OLS. Is there a way to get r-squared for complex survey data in R?

library(survey)

design <- svydesign(id = ~psu, strata = ~strata, weight = ~w_mec, nest = TRUE, data = sample) 

model1 <- svyglm(design = design, bmi ~ 1 + age + black + hispanics + others + female + edu2 + edu3 + edu4 + near_poor + middle + high, family = gaussian(link = "identity"), data = sample)

summary(model1)

Above is an example of what I did in R. This doesn't give r-squared because it's GLM. You don't really need to reproduce anything; this isn't a code issue, I just want to know if there is a way to get r-squared for complex survey data in R.

Best Answer

For a Gaussian glm (where the population parameter is the OLS parameter) you can just divide the dispersion parameter by the population variance and subtract from 1

Using one of the examples from the svyglm help page:

> data(api)
> dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
> api.reg <- svyglm(api.stu~enroll, design=dstrat)
> summary(api.reg)

Call:
svyglm(formula = api.stu ~ enroll, design = dstrat)

Survey design:
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.34383   11.46399   1.164    0.246    
enroll       0.81454    0.02459  33.120   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 7331.633)

Number of Fisher Scoring iterations: 2

> nullmodel<-svyglm(api.stu~1,design=dstrat)
> summary(nullmodel)

Call:
svyglm(formula = api.stu ~ 1, design = dstrat)

Survey design:
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   498.23      16.06   31.02   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 137086.3)

Number of Fisher Scoring iterations: 2

> 1-7331.633/137086.3
[1] 0.9465181

You could also get the null-model variance using svyvar

> svyvar(~api.stu,design=dstrat)
        variance    SE
api.stu   137086 19197

And in this case we have the whole population, so we can run lm on the population and compare the survey estimate of rsquared with the population value

> summary(lm(api.stu ~ enroll,data=apipop))

Call:
lm(formula = api.stu ~ enroll, data = apipop)

Residuals:
     Min       1Q   Median       3Q      Max 
-1021.20   -13.76     6.13    29.56   498.98 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.613245   1.953709   6.968 3.55e-12 ***
enroll       0.813556   0.002522 322.581  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 92.16 on 6155 degrees of freedom
  (37 observations deleted due to missingness)
Multiple R-squared:  0.9442,    Adjusted R-squared:  0.9441 
F-statistic: 1.041e+05 on 1 and 6155 DF,  p-value: < 2.2e-16

As an added bonus answer: if you want the Nagelkerke or Cox-Snell r-squared for binary or count data, there's a function psrsq

Related Solutions

Solved – Survey regression in R with singleton PSUs

You need to install the survey package. Here is an example of how to define the survey design you have specified and how to run a linear regression on these data. I assume that the dataset has already been loaded.

require(survey)
options(survey.lonely.psu = "adjust")
design1 <- svydesign(id = ~psuid, strata = ~stratvar, weights = ~weightvar, data = mydata)
model1 <- svyglm(y ~ x1 + x2, design = design1)
summary(model1)

IMHO, Thomas Lumley's homepage is an excellent starting point for this kind of things.

Rather than only installing the survey package, you can install the Official Statistics task view:

install.packages("ctv")
install.views("OfficialStatistics")

This task view gives you a rather nice and complete toolbox to work with survey data.

Note that with Stata's svyset command you have basically the same possibilities than you have in R to handle singleton sampling units.

Solved – Using post-stratification weights in R survey package

If people say they have post-stratified weights, it does not necessarily mean they implemented post-stratification, proper (as in, rescaled the weights in each demographic cell to the known population total). About 80% of usage that I hear of "post-stratified weights" actually refers to calibrated weights (i.e., rather than trying to adjust each and every cell in a five-way table, the weights are only adjusted to match each of the five variables of the table individually). I produced what somebody referred to as a methodological rant on the distinction. The distinction, however, plays a role in standard error calculations, as Anthony noted in another answer. With properly post-stratified weights, you can apply the regular variance estimation formulae, more or less treating your post-strata as sampling strata (minor technicalities aside). With weights that are only calibrated on each table margin, computations are somewhat more involved. Both procedures are internalized in survey package, anyway, though. You just need to feed your post-stratification/calibration variables to the appropriate design object/formula.

library(survey)
data(api)
# cross-classified post-stratification variable in population
apipop$stype.sch.wide <- 10*as.integer(apipop$stype) +
as.integer(apipop$sch.wide)
# cross-classified post-stratification variable in sample
apiclus1$stype.sch.wide <- 
  10*as.integer(apiclus1$stype) + as.integer(apiclus1$sch.wide)
# population totals
(pop.totals <- xtabs(~stype.sch.wide, data=apipop))
# reference design
dclus1 <- svydesign(id=~dnum,weights=~pw,data=apiclus1,fpc=~fpc)
# post-stratification of the original design
dclus1p <- postStratify(dclus1,~stype.sch.wide, pop.totals)
# design with post-stratified weights, but no evidence of post-stratification
dclus1pfake <- svydesign(id=~dnum,weights=~weights(dclus1p),data=apiclus1,fpc=~fpc)
# taking off the design with known weights, add post-stratification interaction
dclus1pp <- postStratify(dclus1pfake,~stype.sch.wide, pop.totals)

# estimates and standard errors: starting point
svymean(~api00,dclus1)
# post-stratification reduces standard errors a bit
svymean(~api00,dclus1p)
# but here we are not aware of the survey being post-stratified
svymean(~api00,dclus1pfake)
# if we just add post-stratification variables to the design object
# that only had post-stratified weights, the result is the same
# as for post-stratified object based on the original weights
svymean(~api00,dclus1pp)

Best Answer

Related Solutions

Solved – Survey regression in R with singleton PSUs

Solved – Using post-stratification weights in R survey package

Related Question