Solved – Rules of thumb for partial residual (component + residual) plots as diagnostics for linearity

data visualizationdiagnosticloessmultiple regressionresiduals

Here are the standard R diagnostic plots of a multiple linear regression model that includes an autoregressive term at lag-1 (i.e. AR(1)). I have logged & z-scored my input data.

Ben Bolker says here that a scale-location plot is good for determining heteroskedasticity, and a residuals vs. fitted plot is better for determining linearity. So my interpretations of these results are that the multiple regression is pretty linear (residuals vs. fitted plot), and normal (Q-Q plot), essentially homo-skedastic (scale-location), and the outliers aren't too bad (residuals vs. leverage). So far, so good. But when I do a partial residuals (component + residual) plot, the plots for the individual variables show that none of the component variables are linear:

The dotted red lines show the least squares fit, and the green loess smoother lines, as I understand it, indicate the real shape of the data. John Fox's book Applied Regression Analysis and Generalized Linear Models, 3rd ed. in Chapter 12 shows some component + residual plots that he says should be data-transformed for not being linear, but his examples don't show the zig-zag pattern I'm seeing in these plots. So these seem worse than the ones he shows, but on the other hand, maybe the diacy.tmin plot is close enough to linear, even though it wiggles around the least squares fit.

My question is: how bad do the components + residuals plots have to be before it's necessary/advisable to transform the data to improve linearity? Are these plots too problematic to leave in the model as-is? And because the first set of diagnostic plots are well-behaved, and presumably show linearity, does that mean I don't have to take the components + residuals plots as seriously?

Best Answer

I agree with @user2974951. You have to think about how a LOWESS line is fit. Intentionally, it is very wiggly. It is extremely unlikely that it would actually be a perfectly straight line that falls on the dashed regression line. In fact, in most cases where it did, I would suspect overfitting rather than evidence of an appropriate fit. If it pretty much has to wiggle, then, the issue is does it seem to wiggle randomly around your fitted regression line, or does it seem to veer off substantially (and, you'd guess, reliably)? In your case, it doesn't seem like the latter to me.

However, I think the component + residuals plots you are using are harder to read, especially when you aren't as experienced yet. It has been known, going back to at least the 1970's with Tukey and Cleveland, that it's harder to determine if data follow a line when the line is sloped. It is much easier when the line is horizontal. As a result, I would recommend you use plots of residuals vs X, instead. That is, you would make one plot for each X variable (in your case, presumably 5 plots), with the residuals on the vertical axis and the X variable on the horizontal axis. From there, you could plot a faint horizontal line at 0, and overlay a LOWESS line, if you'd like. (Bear in mind that you would have the same issues with the wigglyness of the LOWESS fit in that case.) Then you would look for systematic deviations from the horizonal line in your data.

If you have both the standard plots at the top (i.e., including the scale location plot), and the individual residual vs. X plots, I would just ignore the residual vs. fitted plot. It has become a dominated strategy. You are better able to detect heteroscedasticity in the scale location plot, and non-linearity (more accurately, incorrect functional form) in the residual vs. X plots.

Related Solutions

Solved – Possible extensions to the default diagnostic plots for lm (in R and in general)

Package car has quite a lot of useful functions for diagnostic plots of linear and generalized linear models. Compared to vanilla R plots, they are often enhanced with additional information. I recommend you try example("<function>") on the following functions to see what the plots look like. All plots are described in detail in chapter 6 of Fox & Weisberg. 2011. An R Companion to Applied Regression. 2nd ed.

residualPlots() plots Pearson residuals against each predictor (scatterplots for numeric variables including a Lowess fit, boxplots for factors)
marginalModelPlots() displays scatterplots of the response variable against each numeric predictor, inluding a Lowess fit
avPlots() displays partial-regression plots: for each predictor, this is a scatterplot of a) the residuals from the regression of the response variable on all other predictors against b) the residuals from the regression of the predictor against all other predictors
qqPlot() for a quantile-quantile plot which includes a confidence envelope
influenceIndexPlot() displays each value for Cook's distance, hat-value, p-value for outlier test, and studentized residual in a spike-plot against the observation index
influencePlot() gives a bubble-plot of studentized residuals against hat-values, with the size of the bubble corresponding to Cook's distance, also see dfbetaPlots() and leveragePlots()
boxCox() displays a profile of the log-likelihood for the transformation parameter $\lambda$ in a Box-Cox power-transform
crPlots() is for component + residual plots, a variant of which are CERES plots (Combining conditional Expectations and RESiduals), provided by ceresPlots()
spreadLevelPlot() is for assessing non-constant error variance and displays absolute studentized residuals against fitted values
scatterplot() provides much-enhanced scatterplots inluding boxplots along the axes, confidence ellipses for the bivariate distribution, and prediction lines with confidence bands
scatter3d() is based on package rgl and displays interactive 3D-scatterplots including wire-mesh confidence ellipsoids and prediction planes, make sure to run example("scatter3d")

In addition, have a look at bplot() from package rms for another approach to illustrating the common distribution of three variables.

Solved – Diagnostic plots for count regression

Here is what I usually like doing (for illustration I use the overdispersed and not very easily modelled quine data of pupil's days absent from school from MASS):

Test and graph the original count data by plotting observed frequencies and fitted frequencies (see chapter 2 in Friendly) which is supported by the vcd package in R in large parts. For example, with goodfit and a rootogram:
```
library(MASS)
library(vcd)
data(quine) 
fit <- goodfit(quine$Days) 
summary(fit) 
rootogram(fit)
```
or with Ord plots which help in identifying which count data model is underlying (e.g., here the slope is positive and the intercept is positive which speaks for a negative binomial distribution):
```
Ord_plot(quine$Days)
```
or with the "XXXXXXness" plots where XXXXX is the distribution of choice, say Poissoness plot (which speaks against Poisson, try also type="nbinom"):
```
distplot(quine$Days, type="poisson")
```
Inspect usual goodness-of-fit measures (such as likelihood ratio statistics vs. a null model or similar):
```
mod1 <- glm(Days~Age+Sex, data=quine, family="poisson")
summary(mod1)
anova(mod1, test="Chisq")
```
Check for over / underdispersion by looking at residual deviance/df or at a formal test statistic (e.g., see this answer). Here we have clearly overdispersion:
```
library(AER)
deviance(mod1)/mod1$df.residual
dispersiontest(mod1)
```
Check for influential and leverage points, e.g., with the influencePlot in the car package. Of course here many points are highly influential because Poisson is a bad model:
```
library(car)
influencePlot(mod1)
```
Check for zero inflation by fitting a count data model and its zeroinflated / hurdle counterpart and compare them (usually with AIC). Here a zero inflated model would fit better than the simple Poisson (again probably due to overdispersion):
```
library(pscl)
mod2 <- zeroinfl(Days~Age+Sex, data=quine, dist="poisson")
AIC(mod1, mod2)
```
Plot the residuals (raw, deviance or scaled) on the y-axis vs. the (log) predicted values (or the linear predictor) on the x-axis. Here we see some very large residuals and a substantial deviance of the deviance residuals from the normal (speaking against the Poisson; Edit: @FlorianHartig's answer suggests that normality of these residuals is not to be expected so this is not a conclusive clue):
```
res <- residuals(mod1, type="deviance")
plot(log(predict(mod1)), res)
abline(h=0, lty=2)
qqnorm(res)
qqline(res)
```
If interested, plot a half normal probability plot of residuals by plotting ordered absolute residuals vs. expected normal values Atkinson (1981). A special feature would be to simulate a reference ‘line’ and envelope with simulated / bootstrapped confidence intervals (not shown though):
```
library(faraway)
halfnorm(residuals(mod1))
```
Diagnostic plots for log linear models for count data (see chapters 7.2 and 7.7 in Friendly's book). Plot predicted vs. observed values perhaps with some interval estimate (I did just for the age groups--here we see again that we are pretty far off with our estimates due to the overdispersion apart, perhaps, in group F3. The pink points are the point prediction $\pm$ one standard error):
```
plot(Days~Age, data=quine) 
prs  <- predict(mod1, type="response", se.fit=TRUE)
pris <- data.frame("pest"=prs[[1]], "lwr"=prs[[1]]-prs[[2]], "upr"=prs[[1]]+prs[[2]])
points(pris$pest ~ quine$Age, col="red")
points(pris$lwr  ~ quine$Age, col="pink", pch=19)
points(pris$upr  ~ quine$Age, col="pink", pch=19)
```

This should give you much of the useful information about your analysis and most steps work for all standard count data distributions (e.g., Poisson, Negative Binomial, COM Poisson, Power Laws).

Best Answer

Related Solutions

Solved – Possible extensions to the default diagnostic plots for lm (in R and in general)

Solved – Diagnostic plots for count regression

Related Question