From what I can tell, we cannot run an ordinary least squares
regression in R when using weighted data and the survey
package. Here,
we have to use svyglm()
, which instead runs a generalized linear model
(which may be the same thing? I am fuzzy here in terms of what is
different).
svyglm
will give you a linear model if you use family = gaussian()
which seems to be the default from the survey vignette (in version 3.32-1). See the example where they find the regmodel
.
It seems that the package just makes sure that you use the correct weights when it calls glm
. Thus, if your outcome is continuous and you assume that it is normally iid distributed then you should use family = gaussian()
. The result is a weighted linear model. This answer
Why can we not run OLS in the survey
package, while it seems that this
is possible to do with weighted data in Stata?
by stating that you indeed can do that with the survey
package. As for the following question
What is the difference in interpretation between the deviance of a
generalized linear model and an r-squared value?
There is a straight forward formula to get the $R^2$ with family = gaussian()
as some people have mentioned in the comments. Adding weights does not change anything either as I show below
> set.seed(42293888)
> x <- (-4):5
> y <- 2 + x + rnorm(length(x))
> org <- data.frame(x = x, y = y, weights = 1:10)
>
> # show data and fit model. Notice the R-squared
> head(org)
x y weights
1 -4 0.4963671 1
2 -3 -0.5675720 2
3 -2 -0.3615302 3
4 -1 0.7091697 4
5 0 0.6485203 5
6 1 3.8495979 6
> summary(lm(y ~ x, org, weights = weights))
Call:
lm(formula = y ~ x, data = org, weights = weights)
Weighted Residuals:
Min 1Q Median 3Q Max
-3.1693 -0.4463 0.2017 0.9100 2.9667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7368 0.3514 4.942 0.00113 **
x 0.9016 0.1111 8.113 3.95e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.019 on 8 degrees of freedom
Multiple R-squared: 0.8916, Adjusted R-squared: 0.8781
F-statistic: 65.83 on 1 and 8 DF, p-value: 3.946e-05
>
> # make redundant data set with redundant rows
> idx <- unlist(mapply(rep, x = 1:nrow(org), times = org$weights))
> org_redundant <- org[idx, ]
> head(org_redundant)
x y weights
1 -4 0.4963671 1
2 -3 -0.5675720 2
2.1 -3 -0.5675720 2
3 -2 -0.3615302 3
3.1 -2 -0.3615302 3
3.2 -2 -0.3615302 3
>
> # fit model and notice the same R-squared
> summary(lm(y ~ x, org_redundant))
Call:
lm(formula = y ~ x, data = org_redundant)
Residuals:
Min 1Q Median 3Q Max
-1.19789 -0.29506 -0.05435 0.33131 2.36610
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.73680 0.13653 12.72 <2e-16 ***
x 0.90163 0.04318 20.88 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7843 on 53 degrees of freedom
Multiple R-squared: 0.8916, Adjusted R-squared: 0.8896
F-statistic: 436.1 on 1 and 53 DF, p-value: < 2.2e-16
>
> # glm gives you the same with family = gaussian()
> # just compute the R^2 from the deviances. See
> # https://stats.stackexchange.com/a/46358/81865
> fit <- glm(y ~ x, family = gaussian(), org_redundant)
> fit$coefficients
(Intercept) x
1.7368017 0.9016347
> 1 - fit$deviance / fit$null.deviance
[1] 0.8916387
The deviance is just the sum of square errors when you use family = gaussian()
.
Caveats
I assume that you want a linear model from your question. Further, I have never used the survey
package but quickly scanned through it and made assumptions about what it does which I state in my answer.
Well ca.jo
will give you the parameters for the long-run cointegrating relationship(s), but not the error correction/lags parameters, which are obtained by the cajools
. Note that for alternative, slightly more direct way, you can also use the function VECM
from package "tsDyn".
The reason for calling these function "OLS" is that, once your ECT(s) have been obtained by maximum-likelihood, the ML estimator for the other parameters is equivalent to a OLS regression, conditional on the ML parameters for the cointegrating relationship.
Best Answer
Dropping out the Estimator keyword, Least Squares and Ordinary Least Squares, referred as LS and OLS respectively, are not the same. LS is much more general. It consist of linear and non-linear LS. And, linear LS consist of OLS, and some other types (e.g. GLS: Generalized LS, WLS: Weighted LS). The nonlinear part is itself a different world. In LS methods, the key thing is minimizing the sum of squared residuals. So, any one of the above listed method having this objective functions is a LS estimator. So, LS estimator is not that much definitive. When you make assumptions about the model and the error, it'll boil down to some of the listed alternatives. But, sometimes (and maybe commonly), it is being used for referring to OLS. The author of any article/note saying just LS estimator is probably referring to some sort of special case of it, by making related assumptions, probably somewhere in the article/notes. In your link, Page 9, Section 4.4, it is referring to OLS, in which he actually refers to it in the beginning of the second paragraph.