Residuals vs Fitted Plot – Interpreting a Residuals vs Fitted Plot and Extracting Points

rresiduals

I'm doing a multivariate linear regression with R, and i find myself with the following residuals vs fitted plot:

plot

As you can see there is a very regular line of points that seems to follow a precise pattern.

My questions are:

How do I interpret such a behavior, and what can I do to fix it?
Is there a way to isolate/extract those points? I'd like to take a look at them individually in my data set to see if by examining them I notice some patterns in the data.

Additional info:
My model is:

v.lm = lm(sqrt(v.stima$Y)~., data=v.stima)

Y is a count-variable (non-negative integer). I'm using sqrt because without it the plot has the typical "funnel" shape that indicates a non-homoscedastic error.

Best Answer

Well done for looking at the diagnostic plots for your regression. In this case, they have revealed that your model is inappropriate, as @Glen_b says in the comments. Sometimes you can get away with modelling count data with a gaussian "ordinary" regression. But in this case clearly the violations of the standard assumptions are too strong. There are too many actual values at zero where the model predicts negative values; and this is skewing the whole result and hence leaving a lot of structure in the residuals. You need to move to a Poisson distribution glm.

On the second part of your question, for future reference the identify() function is a good way to identify a few points in a plot eg

plot(predict(v.lm), residuals(v.lm))
identify(predict(v.lm), residuals(v.lm))

Another good trick, when you suspect something about those points, is to create a dummy variable for your candidate explanations (eg 1 when the response=0, 0 otherwise) and map that to a colour aesthetic. ggplot2 is a great package to use for this sort of thing.

Related Solutions

Solved – Use coefficients of thin plate regression splines in a clustering method

If I understand correctly, I think you want the coefficients from the $gam component:

> coef(test$gam)
(Intercept)     s(x1).1     s(x1).2     s(x1).3     s(x1).4     s(x1).5 
 21.8323526   9.2169405  15.7504889  -3.4709907  16.9314057 -19.4909343 
    s(x1).6     s(x1).7     s(x1).8     s(x1).9     s(x2).1     s(x2).2 
  1.1124505  -3.3807996  21.7637766 -23.5791595   3.2303904  -3.0366406 
    s(x2).3     s(x2).4     s(x2).5     s(x2).6     s(x2).7     s(x2).8 
 -2.0725621  -0.6642467   0.7347857   1.7232242  -0.5078875  -7.7776700 
    s(x2).9 
-12.0056347

Update 1: To get at the basis functions we can use predict(...., type = "lpmatrix") to get $Xp$ the smoothing matrix:

Xp <- predict(test$gam, type = "lpmatrix")

The fitted spline (e.g. for s(x1)) can be recovered then using:

plot(Xp[,2:10] %*% coef(test$gam)[2:10], type = "l")

You can plot this ($Xp$) and see that it is similar to um[[1]]$X

layout(matrix(1:2, ncol = 2))
matplot(um[[1]]$X, type = "l")
matplot(Xp[,1:10], type = "l")
layout(1)

I pondered why these are not exactly the same. ~~Is it because the original basis functions have been subject to the penalised regression during fitting???~~

Update 2: You can make them the same by including the identifiability constraints into your basis functions in um:

um2 <- smoothCon(s(x1), data=data.frame(x1=x1), knots=NULL, 
                 absorb.cons=TRUE)
layout(matrix(1:2, ncol = 2))
matplot(um2[[1]]$X, type = "l", main = "smoothCon()")
matplot(Xp[,2:10], type = "l", main = "Xp matrix") ##!##
layout(1)

Note I have not got the intercept in the line marked ##!##.

You ought to be able to get $Xp$ directly from function PredictMat(), which is documented on same page as smoothCon().

Solved – Weird fitted values/residuals plot

You can answer the question for yourself with simple mathematics. If observed $y \ge 0$ and $\hat y$ denotes fitted $y$, then residual $e = y - \hat y$ must be $\ge -\hat y$. The line $e = - \hat y$ is thus a lower limit on your residuals. Despite your unconventional axis choice, it is clear that your data follow suit.

The underlying problem is presumably use of a standard linear model on data not suited to such. One way forward is a log-linear or Poisson(-like) model: (fortuitously but fortunately for the OP as a Stata user) there is a Stata-rich explanation in this blog posting. The posting should be of considerable interest to many users of statistics, however.

P.S. A standard residual plot has residuals on the vertical axis and fitted or predicted on the horizontal axis. The choice of axes is not here an arbitrary convention. A horizontal line indicating zero residuals is the natural reference line, as indicating behaviour matching a perfect model. As emphasised often by J.W. Tukey and others, the best references are linear, and the best linear references are horizontal, in the sense of being easiest to think about. In Stata there is a built-in post-estimation rvfplot for use after regress.

P.P.S. The graph flags a Stata user. Naturally use of Stata is quite secondary here to the main question.

Best Answer

Related Solutions

Solved – Use coefficients of thin plate regression splines in a clustering method

Solved – Weird fitted values/residuals plot

Related Question