Solved – How to compute the standard deviation of residuals from a regression line or curve

regressionresidualsstandard deviation

After fitting a line or curve, it is easy to compute each residual as the difference between the actual Y value and the Y value predicted by the model fit by regression. For standard regression, the goal is to minimize the sum of the square of these residuals.

What is the standard deviation of the residuals? I've always summed the square of all the residuals, divided by (N – K). where N is the number of points and K is the number of parameters fit by regression, and then taking the square root of that quotient. This is called Sy.x or Se.

In a few places I've seen a value called the Root Mean Square Error (RMSE) computed by using N-1 rather than N-K in the denominator. Except for the very special case where you are only fitting one parameter (K=1) this RMSE will differ from Sy.x.

Is there any justification for computing RMSE with a denominator of N-1, or should it always be computed with N-K?

Best Answer

For the variance, the $n-k$ divisor is unbiased (its square root is not unbiased for $\sigma$, though).

The MLE for $\sigma$ (under the usual regression assumptions) would use a divisor of $n$.

The minimum MSE estimator (if one exists) would have a different divisor again (and if it does exist, I really don't think that's going to give $n-1$ in general)

I can't think of any common choice of estimator which would result in an $n-1$ divisor, but I don't think there's any particular reason dismiss it -- it's between the usual (unbiased-for-variance) estimate and the ML estimate at the normal. Those are both choices that have some nice properties, and they're all consistent estimators of $\sigma$; they just arrive at different compromises on trading off desirable properties.