Solved – What kind of residuals and Cook’s distance are used for GLM

cooks-distancegeneralized linear modelrregressionresiduals

Does anybody know what the formula for Cook's distance is? The original Cook's distance formula uses studentized residuals, but why is R using std. Pearson residuals when computing the Cook's distance plot for a GLM. I know that studentized residuals are not defined for GLMs, but how does the formula to compute Cook's distance look like?

Assume the following example:

numberofdrugs <- rcauchy(84, 10)
healthvalue <- rpois(84,75)
test <- glm(healthvalue ~ numberofdrugs, family=poisson)
plot(test, which=5)

What is the formula for Cook's distance? In other words, what is the formula to compute the red dashed line? And where does this formula for standardized Pearson residuals come from?

cook's distance

Best Answer

If you take a look at the code (simple type plot.lm, without parenthesis, or edit(plot.lm) at the R prompt), you'll see that Cook's distances are defined line 44, with the cooks.distance() function. To see what it does, type stats:::cooks.distance.glm at the R prompt. There you see that it is defined as

(res/(1 - hat))^2 * hat/(dispersion * p)

where res are Pearson residuals (as returned by the influence() function), hat is the hat matrix, p is the number of parameters in the model, and dispersion is the dispersion considered for the current model (fixed at one for logistic and Poisson regression, see help(glm)). In sum, it is computed as a function of the leverage of the observations and their standardized residuals. (Compare with stats:::cooks.distance.lm.)

For a more formal reference you can follow references in the plot.lm() function, namely

Belsley, D. A., Kuh, E. and Welsch, R. E. (1980). Regression Diagnostics. New York: Wiley.

Moreover, about the additional information displayed in the graphics, we can look further and see that R uses

plot(xx, rsp, ...                    # line 230
panel(xx, rsp, ...)                  # line 233
cl.h <- sqrt(crit * p * (1 - hh)/hh) # line 243
lines(hh, cl.h, lty = 2, col = 2)    #
lines(hh, -cl.h, lty = 2, col = 2)   #

where rsp is labeled as Std. Pearson resid. in case of a GLM, Std. residuals otherwise (line 172); in both cases, however, the formula used by R is (lines 175 and 178)

residuals(x, "pearson") / s * sqrt(1 - hii)

where hii is the hat matrix returned by the generic function lm.influence(). This is the usual formula for std. residuals:

$$rs_j=\frac{r_j}{\sqrt{1-\hat h_j}}$$

where $j$ here denotes the $j$th covariate of interest. See e.g., Agresti Categorical Data Analysis, §4.5.5.

The next lines of R code draw a smoother for Cook's distance (add.smooth=TRUE in plot.lm() by default, see getOption("add.smooth")) and contour lines (not visible in your plot) for critical standardized residuals (see the cook.levels= option).

Related Solutions

Linear Mixed Effects Models – How to Extract Leverage and Cook’s Distances

You should have a look at the R package influence.ME. It allows you to compute measures of influential data for mixed effects models generated by lme4.

An example model:

library(lme4)
model <- lmer(mpg ~ disp + (1 | cyl), mtcars)

The function influence is the basis for all further steps:

library(influence.ME)
infl <- influence(model, obs = TRUE)

Calculate Cook's distance:

cooks.distance(infl)

Plot Cook's distance:

plot(infl, which = "cook")

enter image description here

Solved – R: test normality of residuals of linear model – which residuals to use

Grew too long for a comment.

For an ordinary regression model (such as would be fitted by lm), there's no distinction between the first two residual types you consider; type="pearson" is relevant for non-Gaussian GLMs, but is the same as response for gaussian models.
The observations you apply your tests to (some form of residuals) aren't independent, so the usual statistics don't have the correct distribution. Further, strictly speaking, none of the residuals you consider will be exactly normal, since your data will never be exactly normal. [Formal testing answers the wrong question - a more relevant question would be 'how much will this non-normality impact my inference?', a question not answered by the usual goodness of fit hypothesis testing.]
Even if your data were to be exactly normal, neither the third nor the fourth kind of residual would be exactly normal. Nevertheless it's much more common for people to examine those (say by QQ plots) than the raw residuals.
You could overcome some of the issues in 2. and 3. (dependence in residuals as well as non-normality in standardized residuals) by simulation conditional on your design matrix ($\mathbf{X}$), meaning you could use whichever residuals you like (however you can't deal with the "answering an unhelpful question you already know the answer to" problem that way).

Best Answer

Related Solutions

Linear Mixed Effects Models – How to Extract Leverage and Cook’s Distances

Solved – R: test normality of residuals of linear model – which residuals to use

Related Question