In my study I have an independent continuous variable x1 (momentum) and four dummy variables D1 D2 D3 D4 which indicate industry type. I am investigating the four interaction variables between the dummy variable and momentum.
Question: do I have to test the interaction effects separately (i.e., one single model for each interaction, that is, one regression for x1D1, a different regression for x1D2…)?
Or do I need to test these three interactions effects in one single regression (i.e., in one single regression I include: x1D1 x1D2 x1D3…)? What is the difference in terms of interpretation?
Note: I have been investigating this for a while now and have come across two arguments:
In favor of a single regression:
Include all the terms together so that you obtain the best possible estimates of your interaction terms. Fitting the models separately would mean failing to control for your other x-covariate.
In favor of multiple regressions:
Including all the dummies and interactions can create multicollinearity problem. Therefore, it is suggested to include one dummy and interaction at a time.
Help would be appreciated, thank you in advance.
Spike Gontscharoff
Best Answer
Assuming the four dummies are not mutually exclusive categories$\dots$
You didn't give us an outcome variable, but let's assume this is continuous and call this $y$, you could specify the model: $$ y = b_0 + b_1 d_1 + b_2 d_2 + b_3 d_3 + b_4 d_4 + b_5 x_1 d_1 +b_6 x_1 d_2+b_7 x_1 d_3+b_8 x_1 d_4 + \epsilon $$ In R, much shorter to write:
The marginal effect for a single dummy, let's say $d_1$, would then be: $$ \frac{\partial y}{\partial d_1} = b_1 + b_5x_1 $$
Unless you have a (a) high degree of multicollinearity, (b) a data set so small that with eight terms you have too few degrees of freedom, or (c) you have a theoretical reason for why you should also interact all your dummies with one another, something like the above should do the trick.
If for the single model, as written in your question, you interact everything-with-everything, that is a lot of terms, and your model will most likely be over-specified (unless you're working with a very large number of observations).
The argument against multiple different models would be that if two of your dummies are correlated with each other and the outcome variable, then leaving one out will produce omitted variable bias. But as ever, the 'best' model will depend on the best theoretical justification for the model and not just the best fit statistics.
If the dummy variables are mutually exclusive$\dots$
Then, both approaches are doing the same thing. You can estimate separate models for each one, or you can drop one of your dummy variables, in which case the dropped dummy variable will be your reference category. Both approaches should give you the same estimates, as noted by @probabilityislogic in the comments.
Update: Example
So, why will we get the same estimate, whether we interact or subset? First, let's generate some random data for our outcome $y$, assign each of our 20k observations to one of four mutually exclusive categories (e.g., industries), and generate another continuous covariate $x$.
Specify first model with interaction:
Now, calculate $\frac{\partial y}{\partial x}$ for when $d2 = 1$ and $d1 = d3 = d4 = 0$.
$$ -0.009853 + 0.010967*1 + 0.016275*0 + -0.007537*0 = 0.001114 $$
Now, let's subset the data to include only observations where $d2 = 1$, and regress $x$ on $y$.
We get the exact same coefficient on $x$ of 0.001114. This is the effect of $x$ on $y$ when $d2=1$.