Solved – Dealing with multicollinearity of explanatory variables in panel regression when the usual remedies fail

multicollinearitypanel dataregression

I am regressing firm characteristics on some stock trading-related measures in a panel dataset. Firm size is a highly significant control variable, independent of the estimation method etc. My focus variables are related to firm size though, either by construction (e.g. $focus variable = x / firmsize$) or because of an economic relationship.

As a consequence, I am finding myself in a classic multicollinearity situation: If firm size is put in as a control variable, my focus variables become insignificant. If firm size is left out, the focus variables are highly significant.

Any of the usual advice (e.g. http://en.wikipedia.org/wiki/Multicollinearity) is not helping: I cannot obtain more data, I cannot run my regression on principal components because I need interpretable coefficients etc.

I have little experience with this kind of problem but with some imagination, I came up with the following two ideas:

  1. Running the regression with firm size as a control variables and additionally including interaction terms between each focus variable and firm size.

  2. Trying to strip away the firm size effect from both the dependent variable and the focus variables, e.g. by first regressing firm size on the dependent/focus variable and then using the residuals as the dependent/focus variable in the actual regression.

Would either or both idea make any sense? Any comment or alternative ideas would be very welcome!

Best Answer

What is firm size measured by -- market capitalization? If so, could you index market cap, and simply rank the firms in terms of size. The regressor would then be "size_rank". Or, bucket firms into small, medium, and large groups. This might mitigate some of the multicollinearity.

The broader issue is that because most of your focus variables include firm size in the denominator, firm size is already implicitly controlled for in your model. In other words, you've already 'normalized' the regressors by firm size. If you think hard about why you need to control for additional variation due to firm size, you might come up with a solution, or toss it out altogether.

For example, if you think that the coefficient on "focus_variable_1" should be different based on the size of the firm, you could add an additional interaction term (firm_size*focus_variable_1). This is along the lines of your suggestion (1) above, however you would want to keep the existing non-interaction term and not also control for firm size. Then to calculate the full impact of a focus_variable_1 on the dependent variable, you would add the coefficient on (focus_variable_1) to the coefficient on the interaction term multiplied by the mean firm size, then maybe +/- one standard deviation. As you can see, interpretation gets difficult quickly, so it is good to have the theory solid before blindly dropping in additional interaction terms.

For additional discussion on interpreting continuous*continuous interaction terms, see: http://www.nd.edu/~rwilliam/stats2/l55.pdf