I am working on a project in which I am using several independent variables to "predict" the values of an outcome using linear regression.
In R this is done quite simply as
model <- lm(outcome ~ predictor1 + predictor2 + predictor3)
fitted <- model$fitted.values
I am interested in the difference between the predicted values and the actual values – i.e. how accurate the predictors are.
residuals <- model$residuals
My question relates to the relationship between residuals
and outcome
.
Samples with lower values of outcome
tend to have negative values for residuals
, and vice versa for samples with high outcome
values.
Plotting the values against one another is the simplest way to see this:
The $R^2$ for the original LM (outcome ~ predictors) is 0.42, the $R^2$ between residuals
and outcome
is 0.58, and the $R^2$ between fitted
and outcome
is 0.39.
What could explain the phenomenon? Why would samples with high outcome
tend to be predicted lower than they actually are, and vice versa for lower values of outcome
? Or indeed, am I missing something conceptually here?
Many thanks for your input
Edited (13.08.20) to include an updated plots and terminology (now use "residuals" rather than "difference") – but in essence the questions remains the same. Thanks all for the input so far.
Best Answer
Basically, it's because the regression isn't perfect.
Suppose you had purely random data - no relation between the dependent and independent variables. Then the best prediction of the DV for every subject would be the mean of the DV.
Suppose you had a perfect relationship; then you be able to exactly predict the DV.
In reality, it's always somewhere in between, and the predicted values are between the mean and the actual values.