Solved – How exactly are standardized residuals calculated

excelregressionresiduals

I'm working on a model for something and at the moment I prefer working solely in Excel. I've been double checking the results of the linear model in JMP, Minitab, and Statistica, and (more or less) been getting the same answers.

One thing that's coming out odd though is my standardized residuals, I'm getting much different answers than Excel's regression routine, and I know it has to do with how I am calculating them:

The standard deviation of our population varies relative to the output, so we work in terms of the relative standard deviation. We have an assumed %RSD of 5% (based on a lot of previous work, we also have reason to assume normality). From this I standardize the residuals by saying $\frac{(x-u)}{u\cdot RSD}$ where x = the observed value and u = the predicted value, so x-u = the residual.

Note that $u\cdot RSD = s$. Simple z-score. Problem is that the values Excel is giving me for the standardized residuals are much different than mine. This isn't exactly surprising since I am using a varying standard deviation. But their values don't seem to be tied to the reality of the data. One observation could be off by as much as 50% (around 6 standard deviations away) and the standardized residuals I'm given are only like 2 or 3.

Anyways, I'm having a really hard time finding out exactly how the residuals are standardized in a linear regression. Any help would be appreciated

Best Answer

The statistical tools in Excel have always been black boxes. There's nothing for it but to do some forensic reverse-engineering. By performing a simple regression in Excel 2013, involving the data $x=(1,2,3,4,5,6,7,8,9)$ and $y=(2,1,4,3,6,5,9,8,7)$, and requesting "standardized residuals" in the dialog, I obtained output that states

The "Standard Error" is $1.3723\ldots$.
There are $9$ observations.
The residuals $r_i$ are $(0.5333\ldots, -1.35, \ldots, 0.35, -1.533\ldots)$.
The corresponding "Standard Residuals" are $(0.4154\ldots, -1.0516\ldots, \ldots, 0.2726\ldots, -1.1944\ldots)$.

Since "standardized" values are typically numbers divided by some estimate of their standard error, I compared these "Standard Residuals" to the residuals and to the "Standard Error." Knowing that various formulas for variances are sums of squares of residuals $r_i$ divided variously by $n$ (the number of data) or $n-p$ (the number of data reduced by the number of variables, in this case two: one for the intercept and a second for the slope), I squared everything in sight. It became immediately obvious that Excel is computing the "Standard Residual" as

$$\frac{r_i}{\sqrt{\frac{1}{n-1}\sum_{i=1}^n r_i^2}}.$$

This formula reproduced Excel's output exactly--not even a trace of floating point roundoff error.

The denominator is what would be computed by Excel's STDEV function. For residuals from a mean, it is an unbiased estimate of their variance. For residuals in a regression, however, it has no standard meaning or value. It's garbage! But now you know how to compute it... .

Related Solutions

Solved – Analysing relation between changes in two time series

It does not seem that you will be able to make something stationary out of the red line whatever you apply - it seems that there were different processes at play before and after mid'07, so I am not sure there is any room for statistical inference here. But you may try differencing both series (subtracting a lagged series) and see if the result looks any more compelling. To create a differenced series in a spreadsheet, copy your series and paste it to the adjacent column but one row lower than the original one. Now subtract the cells on the right from the cells of the same row on the left - it will give you a differenced series one period shorter than the original one.

Solved – does rstandard standardize in z

rstandard() produces standardised residuals via normalisation to unit variance using the overall error variance of the residuals/model.

rstudent() produces Studentized residuals in the same way, but it uses a leave-one-out estimate of the error variance.

The key line in rstandard() is

res <- infl$wt.res/(sd * sqrt(1 - infl$hat))

where sd is defined as

sqrt(deviance(model)/df.residual(model))

where model is the object returned by lm. But note this is not the same as sd(resid(model))

Note that the sd is also scaled by the hat values $1 - h_{ii}$ which together explain the discrepancy with your values.

The key line in rstudent() is

res <- res/(infl$sigma * sqrt(1 - infl$hat))

which is almost the same but sd is replaced via the leave-one-out estimate of the error variance (sd) infl$sigma

Best Answer

Related Solutions

Solved – Analysing relation between changes in two time series

Solved – does rstandard standardize in z

Related Question