Solved – Interpreting Regression Diagnostic Plots

diagnosticqq-plotregressionresiduals

I'm sorry if this is a broad question but can someone explain to me how to interpret these regression diagnostic plots? I understand the Normal Q-Q show's how normal the spread of the data is, but the other three I am confused by. For the Residuals vs Leverage, what does it mean for a point to be outside of the bounds? What does it mean if the red lines are parabolic and curved like in the first two? Why do we care about residuals, and what do they tell us? I've searched the web multiple times for some guide to interpret these and I can't get any straight forward answers. Any help is appreciated, thank you.enter image description here

Best Answer

As mentioned above there are a fair few answers to assessing these kind of plots but it can't hurt to have all the answers in one place. I have created some data and code in R to illustrate my answer:

#Data creation
df <- data.frame(y = c(rep(1:100, 10)))
df$x <- df$y + rnorm(1000, sd = 5)

To begin with it is always good to plot your variables against one another. If you have just one predictor then something like:

 plot(y ~ x, data = df)

works well. If you have multiple predictors then you can use pairs(df). In this case the data looks like it has a positive linear association suggesting that an appropriate model is:

fit <- lm(y ~ x, data = df)

The residual v.s. fitted plot is usually used to determine if the residuals meet the equality of variance assumption that is implicit in linear models which is what I assume you have fitted. There are a few things we look for here - if you have a continuous predictor then we want to see that the residuals are equally and randomly distributed around 0, with essentially no pattern - i.e. white noise. If it looks like white noise it means that your linear model is capturing any pattern in your data. If you have significant curvature (which your plot shows) it means that you have not captured some pattern, this commonly happens when you fit a straight line to data which has a curve in it suggesting that you might need to add a quadratic term to your model - this allows your fitted line to curve. Sometimes this can also be fixed by log transforming your response variable. The importance of this is that your error term will not be accurate otherwise; it will be underestimating the errors at low and high values of your predictor and over-estimating at the mid-range, which in turn means your predictions and coefficients will not be accurate.

The second thing we look for in this plot (when assessing linear models - i.e. your response is normally distributed) is that the residuals are have roughly equal variance across the range of fitted values - I.e. the overall spread of the residuals is roughly equal across the range of fitted values. Put another way you don't have something like a trumpet or rugby ball shape, although there are other inappropriate shapes. This again means that the errors across your fitted line are not equal - some are underestimating while others are over-estimating which will throw out predictions and alters estimates/p-values. The plot should look something like this:

plot(fit, which = 1)

In my case the plot has some funny edges which are a result of how I created the data but it should serve to give an indication of what we are after.

The second plot (Normal Q-Q) checks that the errors are roughly normally distributed meaning that most of the residuals lie close to the line and a few far away. This has important implications for predictions again, but if you are only interested in p-values then it is less important as we can invoke the central limit theorem in most cases - some more reading on this here. Essentially we want the points to lie on the fitted line. If they don't it could be for any number of reasons. In your case it is likely due to the curvature that your model has missed but could also be fixed by a log-transformation of your response. The plot should look something like this:

plot(fit, which = 2)

Another way you can see this is using the normcheck function in the s20x library, which give a bit more intuitive output.

#install.packages("s20x")
library(s20x)
normcheck(fit) 

The third plot (Scale-Location plot) shows much the same as the residual v.s. fitted plot but on a standardised scale. The residual v.s. fitted and scale-location plots can be used to assess heteroscedasticity (variance changing with fitted values) as well. The plot should look something like this:

plot(fit, which = 3) 

This is also a better example of the kind of pattern we want to see in the first plot as it has lost the odd edges.

The final plot (residuals v.s. leverage) is a way of checking if any points are having undue influence over the line. The ordinary least squares equation, which is what linear models employ, tries to minimise the distance of the line from all the points. Points which are further from the line can sometimes have a greater influence over the plot - it's kind of like a sitting on the far end of a seesaw compared to sitting near the fulcrum. Typically the rule of thumb is that if a point has a cook's distance > 0.4 it has a large influence over the line and sometimes this might be criteria to remove it but this should never be done lightly. Your plot shows at least two points (22 and 50 - which corresponds to rows 22 and 50), along with their respective indexes, are having a large influence over your line. The indexes allow one to easily subset the points to have a look at them to see if they are wonky - sometimes it just comes down to an entry error. In your case I suspect that it is because the model is inappropriate rather than the data having anomalies. The plot should look something like this:

plot(fit, which = 4)

You can see in that last one that there are no red boundary lines like in the top right corner of yours - this is because non of my points come close to having high leverage. This can also be assessed using the s20x package in a more intuitive (I feel) way using the cooks20x function:

 cooks20x(fit)

Just look for points that are above 0.4 on the y-axis. The number displayed above each column is the row of the observation in the data frame.

I hope this helps!