R Confidence Intervals – How to Calculate Confidence Intervals for Variance of Residuals

confidence intervalrregressionresidualsvariance

I have three variables:

  • Number of house sales
  • Month (in couples)
  • Region of a city (N-W-E-S)

and I want to calculate confidence intervals for the residual of the errors. So, given the data:

month <- c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", 
     "3", "4", "4", "4", "4", "5", "5", "5", "5", "6", "6", "6", 
     "6")
region <-c("1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", 
      "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", 
      "4")
sales <-c(85, 107, 61, 22, 40, 65, 58, 51, 60, 41, 45, 27, 15, 
          30, 68, 63, 28, 3, 57, 12, 36, 21, 10, 16)

data <- cbind(sales, month, region)
data <- as.data.frame(data)

salesmod <- lm(sales ~ month + region, data=data)
summary(salesmod)
anova(salesmod)

We can check the degrees of freedom and the sum of the residuals from the summary(salesmod) and anova(salesmod).


Call:
lm(formula = sales ~ month + region, data = data)

Residual standard error: 22.88 on 15 degrees of freedom
Multiple R-squared:  0.4855,    Adjusted R-squared:  0.2111 
F-statistic: 1.769 on 8 and 15 DF,  p-value: 0.1623

> anova(salesmod)
Analysis of Variance Table

Response: sales
          Df Sum Sq Mean Sq F value Pr(>F)  
month      5 6368.7 1273.74  2.4325 0.0835 .
region     3 1042.8  347.60  0.6638 0.5871  
Residuals 15 7854.5  523.63                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So the variance of the residuals would be $523.63/15=34.90867$, but how do I compute a confidence intervals for this value (of given 95% confidence).

Best Answer

You seem to have made a small mistake. The variance of residuals is $7854.5/15=523.63$ (you have divided twice).

Now, what you are looking for is distribution of the estimate of the variance of true errors ($\varepsilon$) so that you can construct a confidence interval for it.

First let $\boldsymbol{\varepsilon} \sim N(\mathbf{0},\sigma^2I)$. Now the estimate for $\sigma^2$ is (where $K$ is the number of paramters in the model):

\begin{align} s^2 \equiv \frac{\mathbf{e'e}}{n-K} = \frac{\boldsymbol{\varepsilon'}M\boldsymbol{\varepsilon}}{n-K} \end{align}

So for this you can use the following property that (taken directly from this wiki page)

If $Y$ is a vector of $k$ i.i.d. standard normal random variables and $A$ is a $k\times k$ symmetric, idempotent matrix with rank $k-n$, then the quadratic form $Y^{T}AY$ is chi-square distributed with $k-n$ degrees of freedom.

Since $M$ is idempotent and symmetric with rank $n-K$, it is clear that:

$$\Big(\frac{\boldsymbol{\varepsilon'}}{\sigma}\Big)M\Big(\frac{\boldsymbol{\varepsilon}}{\sigma}\Big) = \frac{n-K}{\sigma^2}s^2 \sim \chi^2_{(n-K)}$$

This can be used now to construct the confidence interval:

$$Pr\bigg(q_{\alpha/2}<\frac{n-K}{\sigma^2}s^2 < q_{1-\alpha/2}\bigg)=1-\alpha$$

$$Pr\bigg(\frac{n-K}{q_{1-\alpha/2}}s^2 < \sigma^2 < \frac{n-K}{q_{\alpha/2}}s^2 \bigg) = 1-\alpha$$

where $q_p$ is the $p^{th}$ quantile of $\chi^2_{(n-K)}$

Based on your data, we can use qchisq in R to get these quantilies.

# lower bound:
sum(salesmod$residuals^2)/qchisq(0.975,df=15)
[1] 285.7373

# upper_bound
sum(salesmod$residuals^2)/qchisq(0.025,df=15)
[1] 1254.277