Solved – Visualising generalized linear model

data visualizationgeneralized linear model

What kind of plot are normally used in Generalized linear model and what are their interpretations?

Especially for Standardized deviance residual vs fitted value plot, what can we see from the plot?

Best Answer

From StatSoft, without links:

Diagnostics in the generalized linear model. The two basic types of residuals are the so-called Pearson residuals and deviance residuals. Pearson residuals are based on the difference between observed responses and the predicted values; deviance residuals are based on the contribution of the observed responses to the log-likelihood statistic. In addition, leverage scores, studentized residuals, generalized Cook's D, and other observational statistics (statistics based on individual observations) can be computed. For a description and discussion of these statistics, see Hosmer and Lemeshow (1989).

If you are using R:

lrfit <- glm( cbind(using,notUsing) ~ age * noMore + hiEduc , family=binomial)
summary(lrfit)
plot(lrfit)

Quoted from Germán Rodríguez:

R follows the popular custom of flagging significant coefficients with one, two or three stars depending on their p-values. Try plot(lrfit). You get the same plots as in a linear model, but adapted to a generalized linear model; for example the residuals plotted are deviance residuals (the square root of the contribution of an observation to the deviance, with the same sign as the raw residual).

The functions that can be used to extract results from the fit include

residuals or resid, for the deviance residuals

fitted or fitted.values, for the fitted values (estimated probabilities)

predict, for the linear predictor (estimated logits)

coef or coefficients, for the coefficients, and

deviance, for the deviance.

Some of these functions have optional arguments; for example, you can extract five different types of residuals, called "deviance", "pearson", "response" (response - fitted value), "working" (the working dependent variable in the IRLS algorithm - linear predictor), and "partial" (a matrix of working residuals formed by omitting each term in the model). You specify the one you want using the type argument, for example residuals(lrfit,type="pearson").

Depending on your type of study, there might be corrections to apply.

Related Solutions

Log-Transformed Response – Linear Model vs. Generalized Linear Model with Log Link

Although it may appear that the mean of the log-transformed variables is preferable (since this is how log-normal is typically parameterised), from a practical point of view, the log of the mean is typically much more useful.

This is particularly true when your model is not exactly correct, and to quote George Box: "All models are wrong, some are useful"

Suppose some quantity is log normally distributed, blood pressure say (I'm not a medic!), and we have two populations, men and women. One might hypothesise that the average blood pressure is higher in women than in men. This exactly corresponds to asking whether log of average blood pressure is higher in women than in men. It is not the same as asking whether the average of log blood pressure is higher in women that man.

Don't get confused by the text book parameterisation of a distribution - it doesn't have any "real" meaning. The log-normal distribution is parameterised by the mean of the log ($\mu_{\ln}$) because of mathematical convenience, but equally we could choose to parameterise it by its actual mean and variance

$\mu = e^{\mu_{\ln} + \sigma_{\ln}^2/2}$

$\sigma^2 = (e^{\sigma^2_{\ln}} -1)e^{2 \mu_{\ln} + \sigma_{\ln}^2}$

Obviously, doing so makes the algebra horribly complicated, but it still works and means the same thing.

Looking at the above formula, we can see an important difference between transforming the variables and transforming the mean. The log of the mean, $\ln(\mu)$, increases as $\sigma^2_{\ln}$ increases, while the mean of the log, $\mu_{\ln}$ doesn't.

This means that women could, on average, have higher blood pressure that men, even though the mean paramater of the log normal distribution ($\mu_{\ln}$) is the same, simply because the variance parameter is larger. This fact would get missed by a test that used log(Blood Pressure).

So far, we have assumed that blood pressure genuinly is log-normal. If the true distributions are not quite log normal, then transforming the data will (typically) make things even worse than above - since we won't quite know what our "mean" parameter actually means. I.e. we won't know those two equations for mean and variance I gave above are correct. Using those to transform back and forth will then introduce additional errors.

Solved – R-squared in linear model verses deviance in generalized linear model

From what I can tell, we cannot run an ordinary least squares regression in R when using weighted data and the survey package. Here, we have to use svyglm(), which instead runs a generalized linear model (which may be the same thing? I am fuzzy here in terms of what is different).

svyglm will give you a linear model if you use family = gaussian() which seems to be the default from the survey vignette (in version 3.32-1). See the example where they find the regmodel.

It seems that the package just makes sure that you use the correct weights when it calls glm. Thus, if your outcome is continuous and you assume that it is normally iid distributed then you should use family = gaussian(). The result is a weighted linear model. This answer

Why can we not run OLS in the survey package, while it seems that this is possible to do with weighted data in Stata?

by stating that you indeed can do that with the survey package. As for the following question

What is the difference in interpretation between the deviance of a generalized linear model and an r-squared value?

There is a straight forward formula to get the $R^2$ with family = gaussian() as some people have mentioned in the comments. Adding weights does not change anything either as I show below

> set.seed(42293888)
> x <- (-4):5
> y <- 2 + x + rnorm(length(x))
> org <- data.frame(x = x, y = y, weights = 1:10)
> 
> # show data and fit model. Notice the R-squared
> head(org) 
   x          y weights
1 -4  0.4963671       1
2 -3 -0.5675720       2
3 -2 -0.3615302       3
4 -1  0.7091697       4
5  0  0.6485203       5
6  1  3.8495979       6
> summary(lm(y ~ x, org, weights = weights))

Call:
lm(formula = y ~ x, data = org, weights = weights)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-3.1693 -0.4463  0.2017  0.9100  2.9667 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.7368     0.3514   4.942  0.00113 ** 
x             0.9016     0.1111   8.113 3.95e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.019 on 8 degrees of freedom
Multiple R-squared:  0.8916,    Adjusted R-squared:  0.8781 
F-statistic: 65.83 on 1 and 8 DF,  p-value: 3.946e-05

> 
> # make redundant data set with redundant rows
> idx <- unlist(mapply(rep, x = 1:nrow(org), times = org$weights))
> org_redundant <- org[idx, ]
> head(org_redundant)
     x          y weights
1   -4  0.4963671       1
2   -3 -0.5675720       2
2.1 -3 -0.5675720       2
3   -2 -0.3615302       3
3.1 -2 -0.3615302       3
3.2 -2 -0.3615302       3
> 
> # fit model and notice the same R-squared
> summary(lm(y ~ x, org_redundant))

Call:
lm(formula = y ~ x, data = org_redundant)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.19789 -0.29506 -0.05435  0.33131  2.36610 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.73680    0.13653   12.72   <2e-16 ***
x            0.90163    0.04318   20.88   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7843 on 53 degrees of freedom
Multiple R-squared:  0.8916,    Adjusted R-squared:  0.8896 
F-statistic: 436.1 on 1 and 53 DF,  p-value: < 2.2e-16

> 
> # glm gives you the same with family = gaussian()  
> # just compute the R^2 from the deviances. See 
> #   https://stats.stackexchange.com/a/46358/81865
> fit <- glm(y ~ x, family = gaussian(), org_redundant)
> fit$coefficients
(Intercept)           x 
  1.7368017   0.9016347 
> 1 - fit$deviance / fit$null.deviance
[1] 0.8916387

The deviance is just the sum of square errors when you use family = gaussian().

Caveats

I assume that you want a linear model from your question. Further, I have never used the survey package but quickly scanned through it and made assumptions about what it does which I state in my answer.

Best Answer

Related Solutions

Log-Transformed Response – Linear Model vs. Generalized Linear Model with Log Link

Solved – R-squared in linear model verses deviance in generalized linear model

Caveats

Related Question