Regression – Calculating Combined Confidence Intervals for Categorical (Dummy) Interactions

confidence intervalinteractionregression

I have an OLS regression with interacted categorical variables. One of these variables encodes $n$ cross-sectional units, the other indicates time along $m$ years. I am interested in estimating unit-time specific effects (conditional on other variables) to get a sense of relative levels and changes over time for each unit.

The model I am fitting has the following form:

$Y = \alpha + \sum\mathbf{\beta_{n-1}X} + \sum\mathbf{\gamma_{m-1}T} + \sum\mathbf{\delta_{(n-1)(m-1)}XT} + … + \epsilon$,

where $\mathbf{X}$ is a matrix of $n-1$ dummy variables that capture the different units (relative to a natural baseline category), and $\mathbf{T}$ is a matrix of $m-1$ dummy variables for the years (relative to the first year). The data actually has more than two dimensions, so I am not estimating a single fixed effect for each observation but I ignore this here for readability.

The sum of the coefficient estimates of $\beta_2 + \gamma_3 + \delta_6$ (for instance) tells me how unit 2 in year 3 differs from baseline unit 0 in baseline year 0, and so forth.

What I would like to know is: How can I calculate the combined confidence intervals for the sum of these three (and other triples of) coefficients?

There are two related posts here and here, which address this question for combining confidence intervals for two coefficient estimates from an interaction. Based on these posts, I have tried to adapt the procedure to the case with three coefficient estimates, but I am uncertain as to whether my approach is correct.

Here is what I did, illustrated on some toy data:

> set.seed(1)
> unit <- sample(0:2, 100, replace = T)
> yr <- sample(0:3, 100, replace = T)
> 
> # for simplicity, create Y based on a linear DGP
> Y <- 1 + .5*unit + 1.5*yr - .9*unit*yr + rnorm(100, sd = 4)
> 
> summary(reg <- lm(Y ~ factor(unit) * factor(yr)))

Call:
lm(formula = Y ~ factor(unit) * factor(yr))

Residuals:
   Min     1Q Median     3Q    Max 
-6.093 -2.373 -0.201  2.433  8.717 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)   
(Intercept)                  1.374      1.135    1.21   0.2291   
factor(unit)1               -2.694      1.702   -1.58   0.1171   
factor(unit)2                3.787      2.123    1.78   0.0779 . 
factor(yr)1                 -0.429      1.768   -0.24   0.8088   
factor(yr)2                  0.534      1.605    0.33   0.7399   
factor(yr)3                  4.789      1.853    2.58   0.0114 * 
factor(unit)1:factor(yr)1    4.714      2.454    1.92   0.0580 . 
factor(unit)2:factor(yr)1   -1.394      2.789   -0.50   0.6183   
factor(unit)1:factor(yr)2    3.314      2.293    1.45   0.1519   
factor(unit)2:factor(yr)2   -6.800      2.763   -2.46   0.0158 * 
factor(unit)1:factor(yr)3    3.121      2.623    1.19   0.2374   
factor(unit)2:factor(yr)3   -8.864      2.818   -3.15   0.0023 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.59 on 88 degrees of freedom
Multiple R-squared:  0.302, Adjusted R-squared:  0.214 
F-statistic: 3.45 on 11 and 88 DF,  p-value: 0.0005

Thus, for unit 2 in year 3 I get:

coef(reg)[c(3,6,12)]
            factor(unit)2               factor(yr)3 factor(unit)2:factor(yr)3 
                  3.78706                   4.78857                  -8.86370 

So, $\beta_2 + \gamma_3 + \delta_6$ equals:

> sum(coef(reg)[c(3,6,12)])
[1] -0.288069

My attempt to calculate combined confidence intervals for the three coefficients looks like this:

# get the degrees of freedom
mf <- model.frame(reg)
DoF <- nrow(mf) - ncol(mf)
ca_1.96 <- qt(0.975, DoF)

# get the relevant part of the variance-covariance matrix
sigma2mat  <- vcov(reg)[c(3,6,12), c(3,6,12)]
sigma2mat 
                          factor(unit)2 factor(yr)3 factor(unit)2:factor(yr)3
factor(unit)2                   4.50659     1.28760                  -4.50659
factor(yr)3                     1.28760     3.43359                  -3.43359
factor(unit)2:factor(yr)3      -4.50659    -3.43359                   7.94018

# calculate standard errors
SE <- sqrt(sum(sigma2mat))
SE
[1] 1.60474

# calculate confidence intervals
CIs <- sum(coef(reg)[c(3,6,12)]) + c(-ca_1.96, ca_1.96) * SE
CIs
[1] -3.47303  2.89690

Edit:

Adding the intercept:

sum(coef(reg)[c(1,3,6,12)])
[1] 1.08606

CIs <- sum(coef(reg)[c(1,3,6,12)]) + c(-ca_1.96, ca_1.96) * SE
CIs
[1] -1.16605  3.33817

I would be grateful for any pointers as to whether this is the right way of going about this – and if not – how it could be done. Thank you!

Best Answer

You've left out the intercept and its part of the variance-covariance matrix from your predictions - otherwise this looks fine, but you've probably worked harder than is necessary. The predict() function can do the hard part for you:

set.seed(1)
unit <- sample(0:2, 100, replace = T)
yr <- sample(0:3, 100, replace = T)

# for simplicity, create Y based on a linear DGP
Y <- 1 + .5*unit + 1.5*yr - .9*unit*yr + rnorm(100, sd = 4)

reg <- lm(Y ~ factor(unit) * factor(yr))

# now create a row of data to be used for prediction:
pred_data = data.frame( unit=factor(2, levels=0:2), yr=factor(3, levels=0:3) )

# if you want a confidence interval (on the fitted mean line):
predict(reg, pred_data, level=0.95, interval="confidence")

# if you want a predictive interval (adds the estimated residual variance)
predict(reg, pred_data, level=0.95, interval="prediction")

Results are:

> predict(reg, pred_data, level=0.95, interval="confidence")
       fit       lwr      upr
1 1.086058 -1.168965 3.341081
> predict(reg, pred_data, level=0.95, interval="prediction")
       fit       lwr      upr
1 1.086058 -6.393008 8.565124
Related Question