Solved – When using linear regression analysis to get the fitted values of an outcome, why do the more extreme values tend to be predicted closer to the mean

I am working on a project in which I am using several independent variables to "predict" the values of an outcome using linear regression.

In R this is done quite simply as

model  <- lm(outcome ~ predictor1 + predictor2 + predictor3)
fitted <- model$fitted.values

I am interested in the difference between the predicted values and the actual values – i.e. how accurate the predictors are.

residuals <- model$residuals

My question relates to the relationship between residuals and outcome.

Samples with lower values of outcome tend to have negative values for residuals, and vice versa for samples with high outcome values.

Plotting the values against one another is the simplest way to see this:

Various plots

The $R^2$ for the original LM (outcome ~ predictors) is 0.42, the $R^2$ between residuals and outcome is 0.58, and the $R^2$ between fitted and outcome is 0.39.

What could explain the phenomenon? Why would samples with high outcome tend to be predicted lower than they actually are, and vice versa for lower values of outcome? Or indeed, am I missing something conceptually here?

Many thanks for your input

Edited (13.08.20) to include an updated plots and terminology (now use "residuals" rather than "difference") – but in essence the questions remains the same. Thanks all for the input so far.

Best Answer

Basically, it's because the regression isn't perfect.

Suppose you had purely random data - no relation between the dependent and independent variables. Then the best prediction of the DV for every subject would be the mean of the DV.

Suppose you had a perfect relationship; then you be able to exactly predict the DV.

In reality, it's always somewhere in between, and the predicted values are between the mean and the actual values.

Best Answer

Related Solutions

Solved – How to test and avoid multicollinearity in mixed linear model

Solved – What resolution should I be using for residuals vs fitted values plot from a linear regression

Related Question