Solved – Residuals correlated positively with response variable strongly in linear regression

correlationresiduals

I did the multiple linear regression on a dataset of 412 observations, with one response variable (Y) and 25 explanatory variables(X1-X25). Y and most of Xs are not normally distributed. Besides, there are some correlation between several Xs. The plot show that the residuals strongly correlated with Y positively and weakly correlated with fitted Y negatively.(Sorry.As I'm newer in this website, I am n't allowed to post images.)

To address these problems, I have tried the principle component regression, the weighted least squared regression and the ridge regression. They all didn't work. I want to know what's wrong with the regression. Why did the residuals correlate with observed Y so obvious?

Best Answer

1) Residuals do correlate positively with observed values in many, many cases. Think of it this way - a very large positive error ("error" is the "true residual", to misuse the language) means that the corresponding observation is, all other things equal, likely to be very large in a positive direction. A very large negative error means that the corresponding observation is likely to be very large in a negative direction. If the $R^2$ of the regression is not large, then the variability of the errors will be the dominating effect on the variability of the target variable, and you will see this effect in your plots and correlations.

For example, consider the model $y_i = a + x_i + e_i$, which we'll model as $y_i = a + bx_i + e_i$, (which is correct for $b = 1$.) Here's the result of a regression with 100 observations:

e <- rnorm(100)
x <- rnorm(100)
y <- 1 + x + e

foo <- lm(y~x)
plot(residuals(foo)~y, xlab="y", ylab="Residuals")

> summary(foo)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3292 -0.8280 -0.0448  0.8213  2.9450 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.8498     0.1288   6.600 2.12e-09 ***
x             0.8929     0.1316   6.787 8.81e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.286 on 98 degrees of freedom
Multiple R-squared: 0.3197, Adjusted R-squared: 0.3128 
F-statistic: 46.06 on 1 and 98 DF,  p-value: 8.813e-10 

enter image description here

Note that we achieved a fairly respectable (in some fields) $R^2$ of 0.32.

We can obscure this effect with a different model:

y <- 1 + 5*x + e

foo <- lm(y~x)
plot(residuals(foo)~y, xlab="y", ylab="Residuals")

which has an $R^2$ of 0.93 and the following residual plot:

enter image description here

Here the correlation between $y$ and the residuals is about 0.25, but it's a lot less obvious on the plot.

2) Residuals have correlation zero with fitted values in a linear regression, by construction. Is your statement "... weakly correlated with fitted Y negatively" based solely upon looking at the plot, or did you actually calculate the correlation? If the former, appearances can be deceiving... if the latter, something is wrong; possibly you aren't looking at what you think you're looking at.

Related Question