Solved – Decreasing trend in residual vs fitted value plot

linear modelmultiple regressionregressionresiduals

Here is my residual vs fitted value plot. It shows a decreasing trend. Can someone explain to me what could cause this to happen and how do I correct my model to produce a better fit? I am fitting something like lm(y~x1+x2)
enter image description here

Best Answer

A hidden binary variable can explain this.

Your model is

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon$$

and it assumes that the error terms $\varepsilon$ average zero and do not seem to be associated with either $x_1$ or $x_2.$ This plot indicates this assumption is not true: the $\varepsilon$ values appear to decrease as $y$ increases and they jump around, apparently at random, by an amount around $6$ for the values of $y$ greater than $30$ or so.

If we re-express the errors $\varepsilon$ as the sum of another variable $z$ and whatever else is left, and suppose that $z=6$ for about half of the observations where $y$ exceeds $30$ and $z=0$ otherwise, and assume we can observe $z,$ then the new model is

$$y = \alpha_0 + \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 z + \delta.$$

Fitting it would (a) account for the strip of points high in the upper left of the plot and (b) allow the coefficients $\alpha_0, \alpha_1,$ and $\alpha_2$ to accommodate the consistent linear downward trend evident in the plot.

As a demonstration, here is code that generates random values according to the second model but fits them according to the first model and the second model. The code displays a column of residual-vs-fitted plots (one for each model), repeating this three more times to give us a sense of what is random and what is baked into the data generation process. Qualitatively they do an excellent job of reproducing your plot: the only noticeable aspect not included in this simulation is the presence of three outlying values in your plot (one near (33,1) and other two labeled "1074066" and "512803").

Figure

Here is the code showing explicitly what is going on.

n <- 150    # Size of dataset
alpha <- 4  # Shape of x1 distribution
mu <- 20    # Mean of x1 distribution

set.seed(17)
par(mfcol=c(2,4))
replicate(4, {
  x1 <- rgamma(n, alpha, alpha / mu)  # Generate the x1 values
  x2 <- rexp(n)                       # Generate smaller x2 values
  y0 <- x1 + x2                                
  z <- ifelse(y0 > 30 & runif(n) <= 0.5, 6, 0) 
  y <- y0 + z + runif(n)              # Generate the response with uniform error

  fit.1 <- lm(y ~ x1 + x2)
  # summary(fit.1)
  fit.2 <- lm(y ~ x1 + x2 + z)
  # summary(fit.2)

  plot(fit.1, which=1, main="Model 1")
  plot(fit.2, which=1, main="Model 2")
})
par(mfcol=c(1,1))

What you might do about this depends on whether you can observe $z.$ If you can, then include it in the model. If you cannot, you might try to predict its values from other variables you do have. As a start, you can separate your residuals into two groups according to whether they lie along the upper line of points or lower line in the plot, then display the distributions of all other variables, one at a time, broken into these two groups. Any variable that exhibits a clear shift in location between the groups will help predict $z.$ You would need to include such a variable--let's call it $x_3$--in a nonlinear model in the form

$$y = \alpha_0 + \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3\mathcal{I}(x_3 \le \tau) + \delta$$

where $\mathcal{I}$ is the binary indicator (it creates the 0/1 variable $z$) and $\tau$ is a threshold to be estimated. Such a model can be fit in various ways; a quick way is to perform a search over a reasonable range of $\tau,$ fitting a standard linear model with least squares for each value of $\tau$ you inspect.