Regression – Trend in Residuals vs Dependent But Not in Residuals vs Fitted

regressionresiduals

I am fitting a linear model to a problem, and a little confused by what is going on. Without the details here are the two plots confusing me:

Residuals vs Fitted

Residuals vs Y

Now the residuals vs fitted looks good to me. Fairly evenly dispersed, no clear pattern. However, the $y$ vs fitted do not look good. I would have expected there to also not be a clear trend in this relationship.

This seems basic, but I'm rather lost. I don't want to fit a model that allows large positive errors at high values of $Y$ to make up for large negative errors at low values of $Y$.

Is this just an indication of a poorly fitting model? Or is something else going on here?

EDIT: Asked for fitted vs y (changed to have labels)

fitted vs y

EDIT2: Just want to point out that this question has already been asked and answered (albeit in a more abstract sense) What is the expected correlation between residual and the dependent variable? .

Best Answer

1) The residuals and the fitted are uncorrelated by construction. In fact if there was any correlation between them, there would be uncaptured linear trend in the data - we could get a closer fit by changing the coefficients until they were uncorrelated.

2) The residuals and the y-variable are always positively correlated. This is a necessary consequence of (1).

$$cov(e,y) = cov(e,e+\hat y) = \sigma^2+0 = \sigma^2$$

So it would be surprising if there wasn't a trend in that first plot.

Consider a simulated example -

Note that by plotting residuals against observed, it's equivalent to using slanted axes in the residuals vs fitted plot:

enter image description here

The reason for the high observed (the grey slanted lines mark constant observed, the ones to the far right are high) being associated with high residuals is clear here, as it the reason for them being only positive near the end.

Related Solutions

Solved – What resolution should I be using for residuals vs fitted values plot from a linear regression

Well correlation is a measure of linear association: given that for every value of x (the fitted values), the value of y (the residuals) is constant; the slope is 0. I.e., there's no correlation.

A significance test of the correlation between the fitted values and the residuals confirms this:

cor.test(fitted(lm_longitude), resid(longitude))

Pearson's product-moment correlation

data:  fitted and residual
t = 0, df = 13, p-value = 1
alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:
-0.5122628  0.5122628

sample estimates:
cor 
1.763983e-14

So you're pretty justified in using the second interpretation. The fact that a pattern seems to emerge on the order of 10^-4 is likely just noise. The scale you use to present the graph of the fitted values against the residuals is less important: use whatever provides a clear display of the data. Either way, there's still no correlation between the two.

Still worried there's a relation? Let's try a third degree polynomial regression, then where res is a vector of the residuals and ftd a vector of the fitted values:

Call:
lm(formula = res ~ ftd + I(ftd^2) + I(ftd^3))

Residuals:
Min         1Q     Median         3Q        Max 
-2.241e-04 -3.768e-05  8.900e-08  3.482e-05  1.738e-04 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0007511  0.0207611  -0.036    0.972
ftd         -0.0109064  0.0137376  -0.794    0.444
I(ftd^2)     0.0003520  0.0090690   0.039    0.970
I(ftd^3)     0.0047866  0.0060009   0.798    0.442

Residual standard error: 9.891e-05 on 11 degrees of freedom
Multiple R-squared:  0.3314,    Adjusted R-squared:  0.1491 
F-statistic: 1.818 on 3 and 11 DF,  p-value: 0.2022

Turns out, there's no significant relationship between these, even if we assume nonlinearity.

It doesn't matter on what scale you visualize the data: objectively, that changes nothing whatsoever. Regardless of how you look at it, either (1) there's really no relationship between the residuals and fitted values or (2) you don't have nearly enough data to conclusively demonstrate that there is.

Solved – Correlation between standardized residuals and fitted values in a linear mixed effect model: Course of action

OK based on this question/answer which states that

A strong correlation [between residuals and fitted values] is not necessarily cause for alarm. This may simply means the underlying process is noisy. However, a low R2 (and hence high correlation between error and dependent) may be due to model misspecification.

and because I can't see any misspecification in my formula I decided in favor of M3. Fortunately M1, M2 and M3 give the same conclusions in terms of which fixed effects are statistically significant.

It seems that the high correlation is simply due to the low prediction value of my factor variables. Not sure exactly why this wasn't evident in M0, M1 and M2 but then again the range of fitted values was also much lower.

Best Answer

Related Solutions

Solved – What resolution should I be using for residuals vs fitted values plot from a linear regression

Solved – Correlation between standardized residuals and fitted values in a linear mixed effect model: Course of action

Related Question