Solved – Apparent correlation between standardized residuals and predicted in regression

diagnosticmultiple regression

The following is the residuals vs predicted scatter plot for a regression model with two IVs. enter image description here

Initially, I thought it was evidence of heteroskedasticity. But, I reasoned that although there is a visible pattern in the plot, the variance across different levels of predicted values is same.

To clear my doubts, I saved the standardized residual and predicted values and ran a bi-variate correlation test. The correlation is zero, as expected.
I am, nevertheless, intrigued by this observation. Any idea why I might have obtained this pattern?

PS:- My dependent variable is a sum of two likert-type items (Response scale: 1-5). So, it's theoretical range is 2-10 and it has no absolute zero value.

Best Answer

Here is an example set-up with observed response just 2(1)10 as reported.

The Stata code should seem fairly transparent even to those who have never used it. gen means generate.

clear 
set obs 500 
set seed 2803 

gen y = round(rnormal(6, 1.5), 1) 
gen x1 = rnormal()
gen x2 = rnormal() 

regress y x1 x2 
rvfplot , mla(y) mlabpos(0) ms(none) 

enter image description here

I'm just regressing the response against Gaussian noise in this example, but the features noticeable on this plot are generic.

For each distinct observed response, there is a line

residual $=$ observed $-$ fitted

So, for observed $= 7$, all those points lie on the line

residual $= 7 -$ fitted

and the slope with fitted is negative (here, where there is no standardization or other adjustment, it is exactly $-1$).

That's always true. Naturally at one extreme if each value of the response is (literally) unique, each line is represented by just a single point and won't be discernible as such. But whenever there are just a few distinct values, as here, the lines will be discernible.

Plotting the numeric values of the response is not standard but surely a useful option to make clear what is happening. If your favourite software won't allow it, you need to change to a new favourite.

Incidentally, I prefer to see the actual values of residual and fitted on these graphs.

Not the question, but with small discrete responses it's worth keeping tracking of whether the model is predicting impossible values. Plain linear regression may be a poor idea for such data.

Related Question