Solved – Apparent correlation between standardized residuals and predicted in regression

diagnosticmultiple regression

The following is the residuals vs predicted scatter plot for a regression model with two IVs.

Initially, I thought it was evidence of heteroskedasticity. But, I reasoned that although there is a visible pattern in the plot, the variance across different levels of predicted values is same.

To clear my doubts, I saved the standardized residual and predicted values and ran a bi-variate correlation test. The correlation is zero, as expected.
I am, nevertheless, intrigued by this observation. Any idea why I might have obtained this pattern?

PS:- My dependent variable is a sum of two likert-type items (Response scale: 1-5). So, it's theoretical range is 2-10 and it has no absolute zero value.

Best Answer

Here is an example set-up with observed response just 2(1)10 as reported.

The Stata code should seem fairly transparent even to those who have never used it. gen means generate.

clear 
set obs 500 
set seed 2803 

gen y = round(rnormal(6, 1.5), 1) 
gen x1 = rnormal()
gen x2 = rnormal() 

regress y x1 x2 
rvfplot , mla(y) mlabpos(0) ms(none)

I'm just regressing the response against Gaussian noise in this example, but the features noticeable on this plot are generic.

For each distinct observed response, there is a line

residual $=$ observed $-$ fitted

So, for observed $= 7$, all those points lie on the line

residual $= 7 -$ fitted

and the slope with fitted is negative (here, where there is no standardization or other adjustment, it is exactly $-1$).

That's always true. Naturally at one extreme if each value of the response is (literally) unique, each line is represented by just a single point and won't be discernible as such. But whenever there are just a few distinct values, as here, the lines will be discernible.

Plotting the numeric values of the response is not standard but surely a useful option to make clear what is happening. If your favourite software won't allow it, you need to change to a new favourite.

Incidentally, I prefer to see the actual values of residual and fitted on these graphs.

Not the question, but with small discrete responses it's worth keeping tracking of whether the model is predicting impossible values. Plain linear regression may be a poor idea for such data.

Related Solutions

R Residuals – Understanding Standardized Residuals in R’s lm Output

If you look at the code for plot.lm (by typing stats:::plot.lm), you see these snippets in there (the comments are mine; they're not in the original):

r <- residuals(x)                                # <---  r contains residuals

...

if (any(show[2L:6L])) {
    s <- if (inherits(x, "rlm")) 
        x$s
    else if (isGlm) 
        sqrt(summary(x)$dispersion)   
    else sqrt(deviance(x)/df.residual(x))        #<---- value of s
    hii <- lm.influence(x, do.coef = FALSE)$hat  #<---- value of hii

...

    r.w <- if (is.null(w)) 
        r                                        #<-- r.w  for unweighted regression
    else sqrt(w) * r
    rs <- dropInf(r.w/(s * sqrt(1 - hii)), hii)  # <-- std. residual in plots

So - if you don't use weights - the code clearly defines its standardized residuals to be the internally studentized residuals defined here:

http://en.wikipedia.org/wiki/Studentized_residual#How_to_studentize

which is to say:

$${\widehat{\varepsilon}_i\over \widehat{\sigma} \sqrt{1-h_{ii}\ }}$$

(where $\widehat{\sigma}^2={1 \over n-m}\sum_{j=1}^n \widehat{\varepsilon}_j^{\,2}$, and $m$ is the column dimension of $X$).

Solved – Negative correlation between regression residual and predicted value

My guess is something's amiss. By "negative relation" I assume that you are seeing what looks like negative correlation. But, perhaps what looks like negative correlation is not really negative correlation because of (for example) overlapping points in your plot. If you are fitting a multiple linear regression with the usual loss function (least squares), then there should be no correlation between the residuals and the predicted values.

Best Answer

Related Solutions

R Residuals – Understanding Standardized Residuals in R’s lm Output

Solved – Negative correlation between regression residual and predicted value

Related Question