Solved – What can be inferred from this residual plot

linear modelmultiple regressionrregressionvalidation

This is with reference to http://analyticspro.org/2016/03/05/r-tutorial-residual-analysis-for-regression/.

Figure

For Residual plot (b), where the residuals are increasing linearly (with respect to the predicted values), can we interpret that we are missing a variable? What else is the interpretation?

Best Answer

The person who produced that plot made a mistake.

Here's why. The setting is ordinary least squares regression (including an intercept term), which is where responses $y_i$ are estimated as linear combinations of regressor variables $x_{ij}$ in the form

$$\hat y_i = \hat\beta_0 + \hat \beta_1 x_{i1} + \hat\beta_2 x_{i2} + \cdots + \hat\beta_p x_{ip}.$$

By definition, the residuals are the differences

$$e_i = y_i - \hat y_i.$$

The plot of $(\hat y_i, e_i)$ in the question shows a strong, consistent linear relationship. In other words, there are numbers $\hat\alpha_0$ and $\hat\alpha_1$--which we can find by fitting a line to the points in that plot--for which the values

$$f_i = e_i - (\hat\alpha_0 + \hat\alpha_1 \hat y_i)$$

are much closer to $0$ than the $e_i$ (in the sense of having much smaller sums of squares). But this says nothing other than that the revised estimates

$$\eqalign{ \hat {y}_i^\prime &= \hat {y}_i + \hat\alpha_0 + \hat\alpha_1 \hat y_i \\ &= (\hat\beta_0 + \hat\alpha_0) + (\hat\alpha_1\hat\beta_1) x_{i1} + \cdots + (\hat\alpha_1\hat\beta_p) x_{ip}\tag{1} }$$

are better, in the least squares sense, than the original estimates, because their residuals are

$$y_i - \hat{y}_i^\prime = e_i - (\hat\alpha_0 + \hat\alpha_1 \hat y_i) = f_i.$$ But this is not possible, because in $(1)$, $\hat y_i^\prime$ has been written explicitly as a linear combination of the original regressors. That means this new solution must have a smaller sum of squared residuals--implying the original fit was not a valid solution.

This result is worth calling a theorem:

Theorem: The least squares slope of the residual-vs-predicted plot in an Ordinary Least Squares model is always zero.


Residual plots like that in the question can arise only when a different model is used. The two most common situations are (1) when the model includes no intercept and (2) the model is not linear. The mechanism in (1) becomes evident when you look at an example:

Figure

Because the model did not include an intercept, the fitted line must pass through $(0,0)$. Since the data points follow a strong linear trend that does not pass though $(0,0)$, the model is poor, the fit is bad, and the best that can be done is to pass the fitted line through the barycenter of the data points. The trend in the residual plot is precisely the difference between the slope of the data points and the slope of the red line at the left.

In this case, contrary to what your reference states, a linear model is definitely valid. The only problem is that this fit failed to include an intercept term.

You may try this example out for yourself by varying the parameters in the R code that produced the figures.

set.seed(17)
x <- seq(15, 6, length.out=50)            # Specify the x-values
y <- -20 + 4 * x + rnorm(length(x), sd=2) # Generate y-values with error
fit <- lm(y ~ x - 1)                      # Fit a no-intercept model

par(mfrow=c(1,2))                         # Prepare for two plots
plot(x,y, xlim=c(0, max(x)), ylim=c(0, max(y)), pch=16, main="Data and Fit")
abline(fit, col="Red", lwd=2, ltw=3)
plot(fit, which=1, pch=16, add.smooth=FALSE) # Residual-vs-predicted plot