What kind of plot are normally used in Generalized linear model and what are their interpretations?
Especially for Standardized deviance residual vs fitted value plot, what can we see from the plot?
data visualizationgeneralized linear model
What kind of plot are normally used in Generalized linear model and what are their interpretations?
Especially for Standardized deviance residual vs fitted value plot, what can we see from the plot?
Although it may appear that the mean of the log-transformed variables is preferable (since this is how log-normal is typically parameterised), from a practical point of view, the log of the mean is typically much more useful.
This is particularly true when your model is not exactly correct, and to quote George Box: "All models are wrong, some are useful"
Suppose some quantity is log normally distributed, blood pressure say (I'm not a medic!), and we have two populations, men and women. One might hypothesise that the average blood pressure is higher in women than in men. This exactly corresponds to asking whether log of average blood pressure is higher in women than in men. It is not the same as asking whether the average of log blood pressure is higher in women that man.
Don't get confused by the text book parameterisation of a distribution - it doesn't have any "real" meaning. The log-normal distribution is parameterised by the mean of the log ($\mu_{\ln}$) because of mathematical convenience, but equally we could choose to parameterise it by its actual mean and variance
$\mu = e^{\mu_{\ln} + \sigma_{\ln}^2/2}$
$\sigma^2 = (e^{\sigma^2_{\ln}} -1)e^{2 \mu_{\ln} + \sigma_{\ln}^2}$
Obviously, doing so makes the algebra horribly complicated, but it still works and means the same thing.
Looking at the above formula, we can see an important difference between transforming the variables and transforming the mean. The log of the mean, $\ln(\mu)$, increases as $\sigma^2_{\ln}$ increases, while the mean of the log, $\mu_{\ln}$ doesn't.
This means that women could, on average, have higher blood pressure that men, even though the mean paramater of the log normal distribution ($\mu_{\ln}$) is the same, simply because the variance parameter is larger. This fact would get missed by a test that used log(Blood Pressure).
So far, we have assumed that blood pressure genuinly is log-normal. If the true distributions are not quite log normal, then transforming the data will (typically) make things even worse than above - since we won't quite know what our "mean" parameter actually means. I.e. we won't know those two equations for mean and variance I gave above are correct. Using those to transform back and forth will then introduce additional errors.
From what I can tell, we cannot run an ordinary least squares regression in R when using weighted data and the
survey
package. Here, we have to usesvyglm()
, which instead runs a generalized linear model (which may be the same thing? I am fuzzy here in terms of what is different).
svyglm
will give you a linear model if you use family = gaussian()
which seems to be the default from the survey vignette (in version 3.32-1). See the example where they find the regmodel
.
It seems that the package just makes sure that you use the correct weights when it calls glm
. Thus, if your outcome is continuous and you assume that it is normally iid distributed then you should use family = gaussian()
. The result is a weighted linear model. This answer
Why can we not run OLS in the
survey
package, while it seems that this is possible to do with weighted data in Stata?
by stating that you indeed can do that with the survey
package. As for the following question
What is the difference in interpretation between the deviance of a generalized linear model and an r-squared value?
There is a straight forward formula to get the $R^2$ with family = gaussian()
as some people have mentioned in the comments. Adding weights does not change anything either as I show below
> set.seed(42293888)
> x <- (-4):5
> y <- 2 + x + rnorm(length(x))
> org <- data.frame(x = x, y = y, weights = 1:10)
>
> # show data and fit model. Notice the R-squared
> head(org)
x y weights
1 -4 0.4963671 1
2 -3 -0.5675720 2
3 -2 -0.3615302 3
4 -1 0.7091697 4
5 0 0.6485203 5
6 1 3.8495979 6
> summary(lm(y ~ x, org, weights = weights))
Call:
lm(formula = y ~ x, data = org, weights = weights)
Weighted Residuals:
Min 1Q Median 3Q Max
-3.1693 -0.4463 0.2017 0.9100 2.9667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7368 0.3514 4.942 0.00113 **
x 0.9016 0.1111 8.113 3.95e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.019 on 8 degrees of freedom
Multiple R-squared: 0.8916, Adjusted R-squared: 0.8781
F-statistic: 65.83 on 1 and 8 DF, p-value: 3.946e-05
>
> # make redundant data set with redundant rows
> idx <- unlist(mapply(rep, x = 1:nrow(org), times = org$weights))
> org_redundant <- org[idx, ]
> head(org_redundant)
x y weights
1 -4 0.4963671 1
2 -3 -0.5675720 2
2.1 -3 -0.5675720 2
3 -2 -0.3615302 3
3.1 -2 -0.3615302 3
3.2 -2 -0.3615302 3
>
> # fit model and notice the same R-squared
> summary(lm(y ~ x, org_redundant))
Call:
lm(formula = y ~ x, data = org_redundant)
Residuals:
Min 1Q Median 3Q Max
-1.19789 -0.29506 -0.05435 0.33131 2.36610
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.73680 0.13653 12.72 <2e-16 ***
x 0.90163 0.04318 20.88 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7843 on 53 degrees of freedom
Multiple R-squared: 0.8916, Adjusted R-squared: 0.8896
F-statistic: 436.1 on 1 and 53 DF, p-value: < 2.2e-16
>
> # glm gives you the same with family = gaussian()
> # just compute the R^2 from the deviances. See
> # https://stats.stackexchange.com/a/46358/81865
> fit <- glm(y ~ x, family = gaussian(), org_redundant)
> fit$coefficients
(Intercept) x
1.7368017 0.9016347
> 1 - fit$deviance / fit$null.deviance
[1] 0.8916387
The deviance is just the sum of square errors when you use family = gaussian()
.
I assume that you want a linear model from your question. Further, I have never used the survey
package but quickly scanned through it and made assumptions about what it does which I state in my answer.
Best Answer
From StatSoft, without links:
If you are using R:
Quoted from Germán Rodríguez:
Depending on your type of study, there might be corrections to apply.