R Residuals – Understanding Standardized Residuals in R’s lm Output

diagnosticlmrresidualsstandardization

If I plot the diagnostic plots to an R regression, a couple of them have "Standardized Residuals" as their y-axis such as in this plot:

enter image description here

What are the residuals standardized over? That is, let us assume that in my model, there are 100 predicted values; hence 100 residuals.

Standardized residual $e_i$ is defined as $(e_i – \bar e)/s_e$(realized residual – mean of all 100 realized residuals)/(standard deviation of all 100 realized residuals)?
Since each residual $e_i$ is itself a realized value out of a distribution of possible realizations for this single residual $e_i$, is this residual $e_i$ normalized by its own mean $\bar e_i$ and variance $\text{Var}(e_i)$ (as opposed to the mean and variance from all other values 1 to 100 as described above)?

I tried finding documentation clarifying this distinction but could not find any that was beyond doubt.

Best Answer

If you look at the code for plot.lm (by typing stats:::plot.lm), you see these snippets in there (the comments are mine; they're not in the original):

r <- residuals(x)                                # <---  r contains residuals

...

if (any(show[2L:6L])) {
    s <- if (inherits(x, "rlm")) 
        x$s
    else if (isGlm) 
        sqrt(summary(x)$dispersion)   
    else sqrt(deviance(x)/df.residual(x))        #<---- value of s
    hii <- lm.influence(x, do.coef = FALSE)$hat  #<---- value of hii

...

    r.w <- if (is.null(w)) 
        r                                        #<-- r.w  for unweighted regression
    else sqrt(w) * r
    rs <- dropInf(r.w/(s * sqrt(1 - hii)), hii)  # <-- std. residual in plots

So - if you don't use weights - the code clearly defines its standardized residuals to be the internally studentized residuals defined here:

http://en.wikipedia.org/wiki/Studentized_residual#How_to_studentize

which is to say:

$${\widehat{\varepsilon}_i\over \widehat{\sigma} \sqrt{1-h_{ii}\ }}$$

(where $\widehat{\sigma}^2={1 \over n-m}\sum_{j=1}^n \widehat{\varepsilon}_j^{\,2}$, and $m$ is the column dimension of $X$).

Related Solutions

Linear Regression in R – How to Interpret Model Diagnostics

This is a long and rambling question, so you are getting a long and rambling answer. Apologies. Using the example from the ?lm() call,

ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2,10,20, labels=c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
summary(lm.D9)
#output#
Call:
lm(formula = weight ~ group)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4938  0.0685  0.2462  1.3690 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.0320     0.2202  22.850 9.55e-15 ***
groupTrt     -0.3710     0.3114  -1.191    0.249    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared: 0.07308,    Adjusted R-squared: 0.02158 
F-statistic: 1.419 on 1 and 18 DF,  p-value: 0.249

I don't entirely understand your confusion on the "coefficients." The table simply presents the OLS estimate of $\beta$, standard error of the estimate $SE(\beta)$, the "distance" that $\beta$ is from 0 on the Normal$(0, SE(\beta))$ distribution, and the probability of observing a $\beta$ that far away from 0. Forgive me for the basic statistics review; I can't tell if this is what you are asking for.

Proper OLS-estimated regression modeling (which is what the lm command runs) requires several assumptions, and these diagnostic plots are designed to test them.

The "Residuals vs Fitted" and "Scale-Location" charts are essentially the same, and show if there is a trend to the residuals. OLS models require that the residuals be "identically and independently distributed," that their distribution does not change substantially for different values of $x$. None of your charts is really satisfactory on this regard. If this assumption is not met, your $\beta$ estimates will still be good, but your $t$-statistics, and corresponding $p$-values, are garbage.

Another assumption is that the errors are approximately normally distributed, which is what the Q-Q plot allows you to see. Again, none of your plots really satisfies me in this regard. The consequences of this assumption not being met are the same as above ($\beta$'s good, $t$'s worthless).

The "outliers" principle is actually not an assumption of OLS regression. But if you have outliers in certain locations, your $\beta$ parameters will be unduly influenced by them. In this case, both your $\beta$ and $t$ measurements are useless. You can remove an influential observation from a data frame by identifying its row number and issuing the command

data <- data[-offending.row,]

Where offending.row is the number of the row you want to eliminate. The R diagnostic plots label the row numbers of potential outliers.

I don't know what kind of data you have, but you should be very careful about eliminating observations that you don't like. You should instead ask yourself how that observation became this way. If it is due to measurement error, by all means discard it. If not, then is this observation a part of the system you are trying to model? If so, you should keep it in and adapt for it in other ways.

I have two suggestions for your analysis. First, try to use GLS estimators. This method assigns weights to your observations to correct for heteroskedasticity, outliers, and some degree of non-normality. The R command for this is gls().

But it seems from your plots that your data are restricted in some ways. In particular Test-P seems like a variable that is either 1 or 0, or restricted to that range. For such a variable, you may want to look at binary logit or probit models, available with the command

glm(model, family=binomial(link="logit"))

If your data is censored at 0 but not on the upper end, a tobit model is what you want, tobit() from the AER package looks like the right thing (I've never run a tobit model, I have just looked at it theoretically).

Finally, predictions are done with the predict() function. If you want to perturb your data afterwards (to create a distribution of possible predictions), the best way I know of it to add a random number to the prediction. Using the example above,

#base prediction
pred.values <- predict(lm.D9)
# get standard error of residuals
SER <- (summary(lm.D9)$sigma)^2
#perturbations
pert <- rnorm(length(pred.values), mean=0, sd=SER)
SIMULATION.VALUES <- pred.values + pert

You can get multiple alternate simulations by repeating the last two steps. Good luck.

Solved – Apparent correlation between standardized residuals and predicted in regression

Here is an example set-up with observed response just 2(1)10 as reported.

The Stata code should seem fairly transparent even to those who have never used it. gen means generate.

clear 
set obs 500 
set seed 2803 

gen y = round(rnormal(6, 1.5), 1) 
gen x1 = rnormal()
gen x2 = rnormal() 

regress y x1 x2 
rvfplot , mla(y) mlabpos(0) ms(none)

I'm just regressing the response against Gaussian noise in this example, but the features noticeable on this plot are generic.

For each distinct observed response, there is a line

residual $=$ observed $-$ fitted

So, for observed $= 7$, all those points lie on the line

residual $= 7 -$ fitted

and the slope with fitted is negative (here, where there is no standardization or other adjustment, it is exactly $-1$).

That's always true. Naturally at one extreme if each value of the response is (literally) unique, each line is represented by just a single point and won't be discernible as such. But whenever there are just a few distinct values, as here, the lines will be discernible.

Plotting the numeric values of the response is not standard but surely a useful option to make clear what is happening. If your favourite software won't allow it, you need to change to a new favourite.

Incidentally, I prefer to see the actual values of residual and fitted on these graphs.

Not the question, but with small discrete responses it's worth keeping tracking of whether the model is predicting impossible values. Plain linear regression may be a poor idea for such data.

Best Answer

Related Solutions

Linear Regression in R – How to Interpret Model Diagnostics

Solved – Apparent correlation between standardized residuals and predicted in regression

Related Question