Solved – > 1 interaction variable, single regression versus multiple regressions

categorical datainteractionmultiple regressionregression

In my study I have an independent continuous variable x1 (momentum) and four dummy variables D1 D2 D3 D4 which indicate industry type. I am investigating the four interaction variables between the dummy variable and momentum.

Question: do I have to test the interaction effects separately (i.e., one single model for each interaction, that is, one regression for x1D1, a different regression for x1D2…)?

Or do I need to test these three interactions effects in one single regression (i.e., in one single regression I include: x1D1 x1D2 x1D3…)? What is the difference in terms of interpretation?

Note: I have been investigating this for a while now and have come across two arguments:

In favor of a single regression:
Include all the terms together so that you obtain the best possible estimates of your interaction terms. Fitting the models separately would mean failing to control for your other x-covariate.

In favor of multiple regressions:
Including all the dummies and interactions can create multicollinearity problem. Therefore, it is suggested to include one dummy and interaction at a time.

Help would be appreciated, thank you in advance.

Spike Gontscharoff

Best Answer

Assuming the four dummies are not mutually exclusive categories$\dots$

You didn't give us an outcome variable, but let's assume this is continuous and call this $y$, you could specify the model: $$ y = b_0 + b_1 d_1 + b_2 d_2 + b_3 d_3 + b_4 d_4 + b_5 x_1 d_1 +b_6 x_1 d_2+b_7 x_1 d_3+b_8 x_1 d_4 + \epsilon $$ In R, much shorter to write:

mod<-lm(y~(d1+d2+d3+d4)*x1)

The marginal effect for a single dummy, let's say $d_1$, would then be: $$ \frac{\partial y}{\partial d_1} = b_1 + b_5x_1 $$

Unless you have a (a) high degree of multicollinearity, (b) a data set so small that with eight terms you have too few degrees of freedom, or (c) you have a theoretical reason for why you should also interact all your dummies with one another, something like the above should do the trick.

If for the single model, as written in your question, you interact everything-with-everything, that is a lot of terms, and your model will most likely be over-specified (unless you're working with a very large number of observations).

The argument against multiple different models would be that if two of your dummies are correlated with each other and the outcome variable, then leaving one out will produce omitted variable bias. But as ever, the 'best' model will depend on the best theoretical justification for the model and not just the best fit statistics.

If the dummy variables are mutually exclusive$\dots$

Then, both approaches are doing the same thing. You can estimate separate models for each one, or you can drop one of your dummy variables, in which case the dropped dummy variable will be your reference category. Both approaches should give you the same estimates, as noted by @probabilityislogic in the comments.

Update: Example

So, why will we get the same estimate, whether we interact or subset? First, let's generate some random data for our outcome $y$, assign each of our 20k observations to one of four mutually exclusive categories (e.g., industries), and generate another continuous covariate $x$.

set.sed(405)
d<-sample(1:4,20000,replace=TRUE)
x<-rnorm(20000,mean=5,sd=4)
y<-rnorm(20000,mean=10,sd=3)
data<-data.frame(cbind(y,d,x))
data$d<-factor(d)

Specify first model with interaction:

mod1<-lm(y~x*d,data=data)
summary(mod1)
Call:
lm(formula = y ~ x * d, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.9035  -2.0169   0.0208   1.9962  13.0574 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.069148   0.068189 147.666   <2e-16 ***
x           -0.009853   0.010702  -0.921    0.357    
d2          -0.061505   0.096142  -0.640    0.522    
d3          -0.091567   0.096191  -0.952    0.341    
d4           0.078902   0.095918   0.823    0.411    
x:d2         0.010967   0.015106   0.726    0.468    
x:d3         0.016275   0.015162   1.073    0.283    
x:d4        -0.007537   0.015029  -0.501    0.616    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.997 on 19992 degrees of freedom
Multiple R-squared:  0.0002449, Adjusted R-squared:  -0.0001052 
F-statistic: 0.6995 on 7 and 19992 DF,  p-value: 0.6726

Now, calculate $\frac{\partial y}{\partial x}$ for when $d2 = 1$ and $d1 = d3 = d4 = 0$.

$$ -0.009853 + 0.010967*1 + 0.016275*0 + -0.007537*0 = 0.001114 $$

Now, let's subset the data to include only observations where $d2 = 1$, and regress $x$ on $y$.

mod2<-lm(y~x,data=subset(data,d=="2"))
summary(mod2)

Call:
lm(formula = y ~ x, data = subset(data, d == "2"))

Residuals:
     Min       1Q   Median       3Q      Max 
-10.5873  -2.0639   0.0417   2.0342  11.4410 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.007643   0.068723 145.623   <2e-16 ***
x            0.001114   0.010810   0.103    0.918    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.039 on 4982 degrees of freedom
Multiple R-squared:  2.13e-06,  Adjusted R-squared:  -0.0001986 
F-statistic: 0.01061 on 1 and 4982 DF,  p-value: 0.918

We get the exact same coefficient on $x$ of 0.001114. This is the effect of $x$ on $y$ when $d2=1$.

Related Question