Solved – does rstandard standardize in z

diagnosticlmrresidualsstandardization

I'm new to R, so please be gentle.

I was under the impression that rstandard(model) returns the z-scores of the residuals in model. However, when I standardized the residuals myself in z, the result was different. In fact, rstandard(model) had a mean different from 0 and had a standard deviation different from 1. The differences seem to be nonnegligible. What exactly does rstandard(model) do? Or am I doing something wrong here?

> x = rnorm(30, mean = 100, sd = 15)
> y = rnorm(30, mean = 100, sd = 15)
> model = lm(y ~ x)
> z.res = residuals(model)/sd(residuals(model)) #standardizing it myself
> rstandard(model) - z.res #difference between rstandard and what i did
            1             2             3             4             5 
-4.422354e-04  1.556269e-04 -4.576832e-03 -1.274350e-03  1.048068e-01 
            6             7             8             9            10 
-2.333922e-02  1.820134e-02 -3.307542e-03  3.368978e-02 -1.804108e-04 
           11            12            13            14            15 
-1.100621e-01 -1.343715e-03 -1.300427e-03  1.509862e-03  3.246602e-03 
           16            17            18            19            20 
 3.734255e-03 -1.821539e-06 -1.153190e-02 -1.713254e-06 -2.185101e-02 
           21            22            23            24            25 
-2.681935e-02  2.562472e-03 -4.721114e-02 -1.084481e-04 -3.430827e-03 
           26            27            28            29            30 
 4.149684e-04  7.705807e-04  2.166815e-03  2.537837e-02  4.182761e-04 
> mean(z.res) 
[1] -9.428041e-18 
#as expected of z-scores, mean is 0
> sd(z.res)
[1] 1 
#as expected of z-scores standard deviation is 1
> mean(rstandard(model)) 
[1] -0.001990908
#not really 0
> sd(rstandard(model)) 
[1] 1.019699
#not really 1

Also, the way I understood Standardized residuals in R's lm output, rstandard is actually studentized residuals. But isn't there already rstudent?

I'm using R version 2.14.1 in Xubuntu 12.04.

Thank you.

Best Answer

rstandard() produces standardised residuals via normalisation to unit variance using the overall error variance of the residuals/model.

rstudent() produces Studentized residuals in the same way, but it uses a leave-one-out estimate of the error variance.

The key line in rstandard() is

res <- infl$wt.res/(sd * sqrt(1 - infl$hat))

where sd is defined as

sqrt(deviance(model)/df.residual(model))

where model is the object returned by lm. But note this is not the same as sd(resid(model))

Note that the sd is also scaled by the hat values $1 - h_{ii}$ which together explain the discrepancy with your values.

The key line in rstudent() is

res <- res/(infl$sigma * sqrt(1 - infl$hat))

which is almost the same but sd is replaced via the leave-one-out estimate of the error variance (sd) infl$sigma

Related Solutions

Solved – When conducting multiple regression, when should you center your predictor variables & when should you standardize them

In regression, it is often recommended to center the variables so that the predictors have mean $0$. This makes it easier to interpret the intercept term as the expected value of $Y_i$ when the predictor values are set to their means. Otherwise, the intercept is interpreted as the expected value of $Y_i$ when the predictors are set to 0, which may not be a realistic or interpretable situation (e.g. what if the predictors were height and weight?). Another practical reason for scaling in regression is when one variable has a very large scale, e.g. if you were using population size of a country as a predictor. In that case, the regression coefficients may be on a very small order of magnitude (e.g. $10^{-6}$) which can be a little annoying when you're reading computer output, so you may convert the variable to, for example, population size in millions. The convention that you standardize predictions primarily exists so that the units of the regression coefficients are the same.

As @gung alludes to and @MånsT shows explicitly (+1 to both, btw), centering/scaling does not affect your statistical inference in regression models - the estimates are adjusted appropriately and the $p$-values will be the same.

Other situations where centering and/or scaling may be useful:

when you're trying to sum or average variables that are on different scales, perhaps to create a composite score of some kind. Without scaling, it may be the case that one variable has a larger impact on the sum due purely to its scale, which may be undesirable.
To simplify calculations and notation. For example, the sample covariance matrix of a matrix of values centered by their sample means is simply $X'X$. Similarly, if a univariate random variable $X$ has been mean centered, then ${\rm var}(X) = E(X^2)$ and the variance can be estimated from a sample by looking at the sample mean of the squares of the observed values.
Related to aforementioned, PCA can only be interpreted as the singular value decomposition of a data matrix when the columns have first been centered by their means.

Note that scaling is not necessary in the last two bullet points I mentioned and centering may not be necessary in the first bullet I mentioned, so the two do not need to go hand and hand at all times.

Linear Model – Effect of Standardization on Y-Intercept

Your model is:

$$y_j = \text{X}_{j} b + \epsilon_j = \sum_{i=0}^px_{ij}b_i + \epsilon_j$$

Let $b_0$ be the intercept, so every $x_{0j} = 1$

$$y_j = b_0 + \sum_{i=1}^px_{ij}b_i + \epsilon_j$$

The average then becomes:

$$\mathbf E[y]=\hat y = \frac{1}{n}\sum_{j=1}^n y_j = \frac{1}{n}\sum_{j=1}^n \left( b_0 + \sum_{i=1}^px_{ij}b_i + \epsilon_j\right)=\\ =\frac{1}{n} n \cdot b_0 + \frac{1}{n}\sum_{j=1}^n \left(\sum_{i=1}^px_{ij}b_i\right) + \frac{1}{n}\sum_{j=1}^n\epsilon_j=\\ =b_0 + \sum_{i=1}^p \left(\frac{1}{n}\sum_{j=1}^nx_{ij}\right)b_i $$

As we made the average of each column of $\mathbf X$ equal to $0$, we get:

$$\hat y = b_0$$

If you standardize $y$ as well, then:

$$\hat y = b_0 = 0$$

QED.

See it only depends on the centering, not on the scale of the variables.

This can easily be shown in R. Compare the three fits, and specially fit2 with mean(y)

x = iris$Petal.Width
y = iris$Petal.Length
fit1 = lm(y ~ x)
fit2 = lm(y ~ I(scale(x)))
mean(y)
fit3 = lm(I(scale(y)) ~ I(scale(x)))

Best Answer

Related Solutions

Solved – When conducting multiple regression, when should you center your predictor variables & when should you standardize them

Linear Model – Effect of Standardization on Y-Intercept

Related Question