Panel Data Regression – Correct Way to Deal with Multiple Fixed Effects

fixed-effects-modelpanel dataregression

I don't have much experience with panel data so I apologize in advance if this sounds ridiculous.

Let's say that I am trying to control for individual and temporal fixed effects when running a panel data regression and I have 998 individuals and 29 years of data. In Stata the way to deal with multi-variate fixed effects is to create dummy variables that uniquely identify each combination. In my case this would be (29×998) – 1 = 28,941 dummy variables.

Will this result in very high multicolinearity? What if I had ~74,000 individuals?

My gut tells me that this is ridiculous, but gut feel and statistics don't really go well together.

Best Answer

When you use time dummies, you don't need a time dummy for every individual separately but for every year. So this leaves you with 28 time dummies and 997 individual dummies (always omitting the first year and first individual to avoid the dummy variable trap).

The solution to your problem is much simpler than what the other answer suggested here. If you read any introductory text on panel data (you can start with these lecture notes), you should acquaint yourself with the fixed effects estimator which is sometimes referred to as the within estimator as well.

The procedure is as follows:

  1. average each variable over time for each individual, e.g. $\overline{y}_i = \frac{1}{T}\sum_{t=1}^{29}y_{it}$ and $\overline{x}_i = \frac{1}{T}\sum_{t=1}^{29}x_{it}$
  2. subtract this individual mean from each observation, $\tilde{y}_{it} = y_{it} - \overline{y}_{i}$ and $\tilde{x}_{it} = x_{it} - \overline{x}_i$
  3. regress $\tilde{y}_{it}$ on $\tilde{x}_{it}$ and you year dummies, and cluster the standard errors on the individual's ID to account for serial correlation

Even though it is not very apparent, Mundlak (1978) has shown that this procedure is equivalent to including a dummy for every individual minus 1 (again omitting the first individual) as you propose it. The advantage is obvious: you don't need all those dummies when you use this three step procedure which is called the "within transformation".

Most statistical software have ready canned packages/routines for this type of estimation as it is fairly standard. In Stata you would simply declare your data to be a panel data set which allows you to use the corresponding panel data regression and data analysis commands. For example:

webuse nlswork
tsset idcode year
xtreg ln_wage age union i.year, cluster(idcode)

Where i.year automatically inserts your year dummies for the regression. So with 29 years you lose 28 degrees of freedom which isn't awful. A nice introduction to the topic is

  • Wooldridge, J. (2008) "Introductory Econometrics", 4th Edition, South Western College
  • Baltagi, B.H. (2013) "Econometric Analysis of Panel Data", 5th Edition, John Wiley & Sons
  • Wooldridge, J. (2010) "Econometric Analysis of Cross Section and Panel Data", 2nd Edition, MIT Press

The last reference is for advanced students.

Related Question