From what I can tell, we cannot run an ordinary least squares
regression in R when using weighted data and the survey
package. Here,
we have to use svyglm()
, which instead runs a generalized linear model
(which may be the same thing? I am fuzzy here in terms of what is
different).
svyglm
will give you a linear model if you use family = gaussian()
which seems to be the default from the survey vignette (in version 3.32-1). See the example where they find the regmodel
.
It seems that the package just makes sure that you use the correct weights when it calls glm
. Thus, if your outcome is continuous and you assume that it is normally iid distributed then you should use family = gaussian()
. The result is a weighted linear model. This answer
Why can we not run OLS in the survey
package, while it seems that this
is possible to do with weighted data in Stata?
by stating that you indeed can do that with the survey
package. As for the following question
What is the difference in interpretation between the deviance of a
generalized linear model and an r-squared value?
There is a straight forward formula to get the $R^2$ with family = gaussian()
as some people have mentioned in the comments. Adding weights does not change anything either as I show below
> set.seed(42293888)
> x <- (-4):5
> y <- 2 + x + rnorm(length(x))
> org <- data.frame(x = x, y = y, weights = 1:10)
>
> # show data and fit model. Notice the R-squared
> head(org)
x y weights
1 -4 0.4963671 1
2 -3 -0.5675720 2
3 -2 -0.3615302 3
4 -1 0.7091697 4
5 0 0.6485203 5
6 1 3.8495979 6
> summary(lm(y ~ x, org, weights = weights))
Call:
lm(formula = y ~ x, data = org, weights = weights)
Weighted Residuals:
Min 1Q Median 3Q Max
-3.1693 -0.4463 0.2017 0.9100 2.9667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7368 0.3514 4.942 0.00113 **
x 0.9016 0.1111 8.113 3.95e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.019 on 8 degrees of freedom
Multiple R-squared: 0.8916, Adjusted R-squared: 0.8781
F-statistic: 65.83 on 1 and 8 DF, p-value: 3.946e-05
>
> # make redundant data set with redundant rows
> idx <- unlist(mapply(rep, x = 1:nrow(org), times = org$weights))
> org_redundant <- org[idx, ]
> head(org_redundant)
x y weights
1 -4 0.4963671 1
2 -3 -0.5675720 2
2.1 -3 -0.5675720 2
3 -2 -0.3615302 3
3.1 -2 -0.3615302 3
3.2 -2 -0.3615302 3
>
> # fit model and notice the same R-squared
> summary(lm(y ~ x, org_redundant))
Call:
lm(formula = y ~ x, data = org_redundant)
Residuals:
Min 1Q Median 3Q Max
-1.19789 -0.29506 -0.05435 0.33131 2.36610
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.73680 0.13653 12.72 <2e-16 ***
x 0.90163 0.04318 20.88 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7843 on 53 degrees of freedom
Multiple R-squared: 0.8916, Adjusted R-squared: 0.8896
F-statistic: 436.1 on 1 and 53 DF, p-value: < 2.2e-16
>
> # glm gives you the same with family = gaussian()
> # just compute the R^2 from the deviances. See
> # https://stats.stackexchange.com/a/46358/81865
> fit <- glm(y ~ x, family = gaussian(), org_redundant)
> fit$coefficients
(Intercept) x
1.7368017 0.9016347
> 1 - fit$deviance / fit$null.deviance
[1] 0.8916387
The deviance is just the sum of square errors when you use family = gaussian()
.
Caveats
I assume that you want a linear model from your question. Further, I have never used the survey
package but quickly scanned through it and made assumptions about what it does which I state in my answer.
I prefer to call GEE an estimation method compared to ML or REML, since it combines quasi-likelihood estimation with robust variance estimation to estimate generalized linear marginal models for longitudinal data. Some texts and papers also call "GEE models", e.g. Hedeker, D., & Gibbons, R. D. (2006). Longitudinal data analysis. Wiley-Interscience
. I guess it is to separate it from subject-specific (fixed and random effects) models, since GEE is mainly regarded as or marginal (population average) models.
We have no idea about the distribution function of the outcome, but we know its mean ($\mu$) and variance ($V$). So we cannot do ML but we can turn to the quasi-likelihood,
$$Q(\mu,y)=\int^{\mu}_y(y-t)^TV^{-1}dt,$$
and the quasi-likelihood estimating equations (quasi-score function) is
$$\sum_i\frac{\partial{\mu_i^{'}}}{\partial{\beta}}V_i^{-1}(y_i-\mu_i)=0.$$
Thus the estimating equations are derived without specifying the joint distribution of a outcomes but they reduce to the score equations (marginal distributions). The approach based on maximum likelihood (ML) estimation specifies the joint multivariate normal distribution of outcome variables, while the approach of GEE based on the quasi-likelihood specifies only the marginal distributions.
I have seen GEE was applied in statistical genetics, but I am afraid it is also under the framework of generalized linear models.
Best Answer
Your question- yes they can be. Technically, you can use a normal r squared measurement as your goodness of fit measure. It might not be a very good fit measure, but you can certainly use it. Further, you have to ask yourself if your increase in precision is worth the loss of readability of your findings. For example, moving from r-squared to an adjusted r-square is likely to be a meaningful increase in precision at the sacrifice of readability. I personally like McKelvey & Zavoina and other similar approaches (e.g. xu's r squared for mixed models). That does not mean they are the best or only approaches.