I'm sorry if this is a broad question but can someone explain to me how to interpret these regression diagnostic plots? I understand the Normal Q-Q show's how normal the spread of the data is, but the other three I am confused by. For the Residuals vs Leverage, what does it mean for a point to be outside of the bounds? What does it mean if the red lines are parabolic and curved like in the first two? Why do we care about residuals, and what do they tell us? I've searched the web multiple times for some guide to interpret these and I can't get any straight forward answers. Any help is appreciated, thank you.
Solved – Interpreting Regression Diagnostic Plots
diagnosticqq-plotregressionresiduals
Related Solutions
Package car
has quite a lot of useful functions for diagnostic plots of linear and generalized linear models. Compared to vanilla R plots, they are often enhanced with additional information. I recommend you try example("<function>")
on the following functions to see what the plots look like. All plots are described in detail in chapter 6 of Fox & Weisberg. 2011. An R Companion to Applied Regression. 2nd ed.
residualPlots()
plots Pearson residuals against each predictor (scatterplots for numeric variables including a Lowess fit, boxplots for factors)marginalModelPlots()
displays scatterplots of the response variable against each numeric predictor, inluding a Lowess fitavPlots()
displays partial-regression plots: for each predictor, this is a scatterplot of a) the residuals from the regression of the response variable on all other predictors against b) the residuals from the regression of the predictor against all other predictorsqqPlot()
for a quantile-quantile plot which includes a confidence envelopeinfluenceIndexPlot()
displays each value for Cook's distance, hat-value, p-value for outlier test, and studentized residual in a spike-plot against the observation indexinfluencePlot()
gives a bubble-plot of studentized residuals against hat-values, with the size of the bubble corresponding to Cook's distance, also seedfbetaPlots()
andleveragePlots()
boxCox()
displays a profile of the log-likelihood for the transformation parameter $\lambda$ in a Box-Cox power-transformcrPlots()
is for component + residual plots, a variant of which are CERES plots (Combining conditional Expectations and RESiduals), provided byceresPlots()
spreadLevelPlot()
is for assessing non-constant error variance and displays absolute studentized residuals against fitted valuesscatterplot()
provides much-enhanced scatterplots inluding boxplots along the axes, confidence ellipses for the bivariate distribution, and prediction lines with confidence bandsscatter3d()
is based on packagergl
and displays interactive 3D-scatterplots including wire-mesh confidence ellipsoids and prediction planes, make sure to runexample("scatter3d")
In addition, have a look at bplot()
from package rms
for another approach to illustrating the common distribution of three variables.
I think this is one of the most challenging parts when doing regression analysis. I also struggle with most of the interpretations (in particular binomial diagnostics are crazy!).
I just stumbled on this post http://www.r-bloggers.com/model-validation-interpreting-residual-plots/ who also linked https://web.archive.org/web/20100202230711/http://statmaster.sdu.dk/courses/st111/module04/module.pdf
what helps me the most is to plot the residuals versus every predictive parameter included AND not included into the model. This means also the ones who were dropped beforehand for to multicolinearity reasons. For this boxplots, conditional scatterplots and normal scatterplots are great. this helps to spot possible errors
In "Forest Analytics with R" (UseR Series) are some good explanations how to interpret residuals for mixed effects models (and glms as well). Good read! https://www.springer.com/gp/book/9781441977618
Someday ago I thought about a website that could collect residual patterns which users can vote to be "ok" and to be "not ok". but I never found that website ;)
Best Answer
As mentioned above there are a fair few answers to assessing these kind of plots but it can't hurt to have all the answers in one place. I have created some data and code in R to illustrate my answer:
To begin with it is always good to plot your variables against one another. If you have just one predictor then something like:
works well. If you have multiple predictors then you can use
pairs(df)
. In this case the data looks like it has a positive linear association suggesting that an appropriate model is:The residual v.s. fitted plot is usually used to determine if the residuals meet the equality of variance assumption that is implicit in linear models which is what I assume you have fitted. There are a few things we look for here - if you have a continuous predictor then we want to see that the residuals are equally and randomly distributed around 0, with essentially no pattern - i.e. white noise. If it looks like white noise it means that your linear model is capturing any pattern in your data. If you have significant curvature (which your plot shows) it means that you have not captured some pattern, this commonly happens when you fit a straight line to data which has a curve in it suggesting that you might need to add a quadratic term to your model - this allows your fitted line to curve. Sometimes this can also be fixed by log transforming your response variable. The importance of this is that your error term will not be accurate otherwise; it will be underestimating the errors at low and high values of your predictor and over-estimating at the mid-range, which in turn means your predictions and coefficients will not be accurate.
The second thing we look for in this plot (when assessing linear models - i.e. your response is normally distributed) is that the residuals are have roughly equal variance across the range of fitted values - I.e. the overall spread of the residuals is roughly equal across the range of fitted values. Put another way you don't have something like a trumpet or rugby ball shape, although there are other inappropriate shapes. This again means that the errors across your fitted line are not equal - some are underestimating while others are over-estimating which will throw out predictions and alters estimates/p-values. The plot should look something like this:
In my case the plot has some funny edges which are a result of how I created the data but it should serve to give an indication of what we are after.
The second plot (Normal Q-Q) checks that the errors are roughly normally distributed meaning that most of the residuals lie close to the line and a few far away. This has important implications for predictions again, but if you are only interested in p-values then it is less important as we can invoke the central limit theorem in most cases - some more reading on this here. Essentially we want the points to lie on the fitted line. If they don't it could be for any number of reasons. In your case it is likely due to the curvature that your model has missed but could also be fixed by a log-transformation of your response. The plot should look something like this:
Another way you can see this is using the
normcheck
function in thes20x
library, which give a bit more intuitive output.The third plot (Scale-Location plot) shows much the same as the residual v.s. fitted plot but on a standardised scale. The residual v.s. fitted and scale-location plots can be used to assess heteroscedasticity (variance changing with fitted values) as well. The plot should look something like this:
This is also a better example of the kind of pattern we want to see in the first plot as it has lost the odd edges.
The final plot (residuals v.s. leverage) is a way of checking if any points are having undue influence over the line. The ordinary least squares equation, which is what linear models employ, tries to minimise the distance of the line from all the points. Points which are further from the line can sometimes have a greater influence over the plot - it's kind of like a sitting on the far end of a seesaw compared to sitting near the fulcrum. Typically the rule of thumb is that if a point has a cook's distance > 0.4 it has a large influence over the line and sometimes this might be criteria to remove it but this should never be done lightly. Your plot shows at least two points (22 and 50 - which corresponds to rows 22 and 50), along with their respective indexes, are having a large influence over your line. The indexes allow one to easily subset the points to have a look at them to see if they are wonky - sometimes it just comes down to an entry error. In your case I suspect that it is because the model is inappropriate rather than the data having anomalies. The plot should look something like this:
You can see in that last one that there are no red boundary lines like in the top right corner of yours - this is because non of my points come close to having high leverage. This can also be assessed using the
s20x
package in a more intuitive (I feel) way using thecooks20x
function:Just look for points that are above 0.4 on the y-axis. The number displayed above each column is the row of the observation in the data frame.
I hope this helps!