I have an OLS regression with interacted categorical variables. One of these variables encodes $n$ cross-sectional units, the other indicates time along $m$ years. I am interested in estimating unit-time specific effects (conditional on other variables) to get a sense of relative levels and changes over time for each unit.
The model I am fitting has the following form:
$Y = \alpha + \sum\mathbf{\beta_{n-1}X} + \sum\mathbf{\gamma_{m-1}T} + \sum\mathbf{\delta_{(n-1)(m-1)}XT} + … + \epsilon$,
where $\mathbf{X}$ is a matrix of $n-1$ dummy variables that capture the different units (relative to a natural baseline category), and $\mathbf{T}$ is a matrix of $m-1$ dummy variables for the years (relative to the first year). The data actually has more than two dimensions, so I am not estimating a single fixed effect for each observation but I ignore this here for readability.
The sum of the coefficient estimates of $\beta_2 + \gamma_3 + \delta_6$ (for instance) tells me how unit 2 in year 3 differs from baseline unit 0 in baseline year 0, and so forth.
What I would like to know is: How can I calculate the combined confidence intervals for the sum of these three (and other triples of) coefficients?
There are two related posts here and here, which address this question for combining confidence intervals for two coefficient estimates from an interaction. Based on these posts, I have tried to adapt the procedure to the case with three coefficient estimates, but I am uncertain as to whether my approach is correct.
Here is what I did, illustrated on some toy data:
> set.seed(1)
> unit <- sample(0:2, 100, replace = T)
> yr <- sample(0:3, 100, replace = T)
>
> # for simplicity, create Y based on a linear DGP
> Y <- 1 + .5*unit + 1.5*yr - .9*unit*yr + rnorm(100, sd = 4)
>
> summary(reg <- lm(Y ~ factor(unit) * factor(yr)))
Call:
lm(formula = Y ~ factor(unit) * factor(yr))
Residuals:
Min 1Q Median 3Q Max
-6.093 -2.373 -0.201 2.433 8.717
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.374 1.135 1.21 0.2291
factor(unit)1 -2.694 1.702 -1.58 0.1171
factor(unit)2 3.787 2.123 1.78 0.0779 .
factor(yr)1 -0.429 1.768 -0.24 0.8088
factor(yr)2 0.534 1.605 0.33 0.7399
factor(yr)3 4.789 1.853 2.58 0.0114 *
factor(unit)1:factor(yr)1 4.714 2.454 1.92 0.0580 .
factor(unit)2:factor(yr)1 -1.394 2.789 -0.50 0.6183
factor(unit)1:factor(yr)2 3.314 2.293 1.45 0.1519
factor(unit)2:factor(yr)2 -6.800 2.763 -2.46 0.0158 *
factor(unit)1:factor(yr)3 3.121 2.623 1.19 0.2374
factor(unit)2:factor(yr)3 -8.864 2.818 -3.15 0.0023 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.59 on 88 degrees of freedom
Multiple R-squared: 0.302, Adjusted R-squared: 0.214
F-statistic: 3.45 on 11 and 88 DF, p-value: 0.0005
Thus, for unit 2 in year 3 I get:
coef(reg)[c(3,6,12)]
factor(unit)2 factor(yr)3 factor(unit)2:factor(yr)3
3.78706 4.78857 -8.86370
So, $\beta_2 + \gamma_3 + \delta_6$ equals:
> sum(coef(reg)[c(3,6,12)])
[1] -0.288069
My attempt to calculate combined confidence intervals for the three coefficients looks like this:
# get the degrees of freedom
mf <- model.frame(reg)
DoF <- nrow(mf) - ncol(mf)
ca_1.96 <- qt(0.975, DoF)
# get the relevant part of the variance-covariance matrix
sigma2mat <- vcov(reg)[c(3,6,12), c(3,6,12)]
sigma2mat
factor(unit)2 factor(yr)3 factor(unit)2:factor(yr)3
factor(unit)2 4.50659 1.28760 -4.50659
factor(yr)3 1.28760 3.43359 -3.43359
factor(unit)2:factor(yr)3 -4.50659 -3.43359 7.94018
# calculate standard errors
SE <- sqrt(sum(sigma2mat))
SE
[1] 1.60474
# calculate confidence intervals
CIs <- sum(coef(reg)[c(3,6,12)]) + c(-ca_1.96, ca_1.96) * SE
CIs
[1] -3.47303 2.89690
Edit:
Adding the intercept:
sum(coef(reg)[c(1,3,6,12)])
[1] 1.08606
CIs <- sum(coef(reg)[c(1,3,6,12)]) + c(-ca_1.96, ca_1.96) * SE
CIs
[1] -1.16605 3.33817
I would be grateful for any pointers as to whether this is the right way of going about this – and if not – how it could be done. Thank you!
Best Answer
You've left out the intercept and its part of the variance-covariance matrix from your predictions - otherwise this looks fine, but you've probably worked harder than is necessary. The
predict()
function can do the hard part for you:Results are: