Solved – How is the root mean square error related to the standard deviation of a sample

confidence intervalmathematical-statisticsstandard deviation

If I am given a few observations with two variables (let's just say servicetime and desktops) I can use the following to find the confidence interval: $CI = \bar{y} \pm t^{(n-1)}_{(\alpha/2)} \cdot \frac{s}{\sqrt{n}}$, where $\bar{y}$ is the mean, $n$ is the number of observations, and $s$ is the standard deviation.

Now let's say we still have the same data but we have an extra observation where one of the variables are known and the other variable is estimated in the model, and we want to find the confidence interval for the estimate of the mean servicetime given desktops=$9$. So here our regression would look like this, $\text{servicetime}=\beta_0 + \beta_1 \text{desktops}$. To find the confidence interval we have to use $CI=\hat{y} \pm t^{(n-2)}_{(\alpha/2)} \cdot s \cdot \sqrt{DV}$ where $DV$ is the distance value and this time $s$ is the root mean square error.

My question is why do we use the root mean square error in this situation instead of the standard deviation? What is the relationship between the standard deviation and the root mean square error?

Best Answer

(Note that there are many assumptions required to use the confidence interval formulas you quote. For simplicity I will ignore these here.)

The answer is the missing term in your regression equation, which should really be written $$\text{servicetime}=\beta_0 + \beta_1 \text{desktops} + \epsilon$$ where the error term $\epsilon$ is typically assumed to be unbiased, i.e. $\mathbb{E}[\epsilon]=0$.

In the standard OLS model, the residuals $\epsilon_i=\text{servicetime}_i-(\beta_0 + \beta_1 \text{desktops}_i)$ are assumed to be i.i.d.

So short answer: $\text{RMS error} = \text{standard deviation of residuals}$.

Related Question