Residuals vs Fitted Plot – Interpreting a Residuals vs Fitted Plot and Extracting Points

rresiduals

I'm doing a multivariate linear regression with R, and i find myself with the following residuals vs fitted plot:

plot

As you can see there is a very regular line of points that seems to follow a precise pattern.

My questions are:

  1. How do I interpret such a behavior, and what can I do to fix it?
  2. Is there a way to isolate/extract those points? I'd like to take a look at them individually in my data set to see if by examining them I notice some patterns in the data.

Additional info:
My model is:

v.lm = lm(sqrt(v.stima$Y)~., data=v.stima)

Y is a count-variable (non-negative integer). I'm using sqrt because without it the plot has the typical "funnel" shape that indicates a non-homoscedastic error.

Best Answer

Well done for looking at the diagnostic plots for your regression. In this case, they have revealed that your model is inappropriate, as @Glen_b says in the comments. Sometimes you can get away with modelling count data with a gaussian "ordinary" regression. But in this case clearly the violations of the standard assumptions are too strong. There are too many actual values at zero where the model predicts negative values; and this is skewing the whole result and hence leaving a lot of structure in the residuals. You need to move to a Poisson distribution glm.

On the second part of your question, for future reference the identify() function is a good way to identify a few points in a plot eg

plot(predict(v.lm), residuals(v.lm))
identify(predict(v.lm), residuals(v.lm))

Another good trick, when you suspect something about those points, is to create a dummy variable for your candidate explanations (eg 1 when the response=0, 0 otherwise) and map that to a colour aesthetic. ggplot2 is a great package to use for this sort of thing.