Logistic Regression – Is Mundlak Fixed Effects Procedure Applicable for Logistic Regression with Dummies?

categorical datafixed-effects-modellogisticstata

I have a dataset with 8000 clusters and 4 million observations. Unfortunately my statistical software, Stata, runs rather slowly when using its panel data function for logistic regression: xtlogit, even with a 10% subsample.

However, when using the nonpanel logit function results appear much sooner. Therefore I may be able to benefit from using logit on modified data that accounts for fixed effects.

I believe this procedure is coined the "Mundlak fixed effects procedure" (Mundlak, Y. 1978. Pooling of Time-Series and Cross-Section Data. Econometrica, 46(1), 69-85.)

I found an intuitive explanation of this procedure in a paper by Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A review and recommendations. The Leadership Quarterly, 21(6). 1086-1120. I quote:

One way to get around the problem of omitted ﬁxed effects and to still
include Level 2 variables is to include the cluster means of all Level
1 covariates in the estimated model (Mundlak, 1978). The cluster means
can be included as regressors or subtracted (i.e., cluster-mean
centering) from the Level 1 covariate. The cluster means are invariant
within cluster (and vary between clusters) and allow for consistent
estimation of Level 1 parameters just as if ﬁxed-effects had been
included (see Rabe-Hesketh & Skrondal, 2008).

Therefore cluster-mean centering seems ideal and practical for solving my computational problem. However, these papers seem to be geared towards linear regression (OLS).

Is this method of cluster-mean centering also applicable for "replicating" fixed effects binary logistic regression?

A more technical question that should result in the same answer would be: is xtlogit depvar indepvars, fe with dataset A equal to logit depvar indepvars with dataset B when dataset B is the cluster-mean centered version of dataset A?

An added difficulty I found in this cluster-mean centering is how to cope with dummies. Because dummies are either 0 or 1, are they identical in random and fixed effects regression? Should they not be "centered"?

Best Answer

First differencing or within transformations like demeaning are not available in models like logit because in the case of nonlinear models such tricks do not remove the unobserved fixed effects. Even if you had a smaller data set in which it was feasible to include N-1 individual dummies to estimate the fixed effects directly, this would lead to biased estimates unless the time dimension of your data is large. Elimination of the fixed effects in panel logit therefore follows neither differencing nor demeaning and is only possible due to the logit functional form. If you are interested in the details you could have a look at these notes by Söderbom on PDF page 30 (explanation for why demeaning/first differencing in logit/probit doesn't help) and page 42 (introduction of the panel logit estimator).

Another problem is that xtlogit and panel logit models in general do not estimate the fixed effects directly which are needed to calculate marginal effects. Without those it will be very awkward to interpret your coefficients which might be disappointing after having run the model for hours and hours.

With such a large data set and the previously mentioned conceptional difficulties of FE panel logit I would stick with the linear probability model. I hope this answer does not disappoint you but there are many good reasons for giving such advice: the LPM is much faster, the coefficients can be interpreted straight away (this holds in particular if you have interaction effects in your model because the interpretation of their coefficients in non-linear models changes!), the fixed effects are easily controlled for and you can adjust the standard errors for autocorrelation and clusters without estimation times increasing beyond reason. I hope this helps.

Related Solutions

Solved – Logistic regression: fixed effects for firms, countries & years

If firms are associated with one country, then if you have firm fixed effects you don't need country dummies as well. In fact, you can't estimate both, since the country effect (unless interacted with a time dummy) is time invariant, so it is collinear with the firm fixed effect. Thus, you can't estimate both. But that is not a problem, because the country effect is already captured by the firm fixed effect.

Should you include year fixed effects? Depends on your data and research question, but if you want to control for year effects that affect all firms in all countries, then you should include them. For example, if there were a global macroeconomic shock in a year, then year fixed effects would be one way to control for it.

When you implement fixed effects in nonlinear panel models like logit, you shouldn't do it by throwing in dummies for firms and years like you might with OLS. Those estimates are biased if you have insufficient observations per dummy (you would have 11 observations per firm dummy, not enough). Instead, use the conditional logit fixed effects estimator, which should be implemented in newer versions of statistics software. In Stata, you can do this via

xtset firmid year
xtlogit depvar x1 x2 x3, fe

In short, you should use firm fixed effects if you believe you have not included essential time invariant explanatory variables. Fixed effects will control for those time invariant factors. You should not use fixed effects if you want to estimate the effect of particular time invariant factors. You could not estimate those coefficients jointly with fixed effects. In most cases, fixed effects make your regression more robust, and that's why most economists use fixed effects.

Last point: are you estimating bankruptcy probabilities of firms? It seems to me that survival models, rather than binary choice models, would be more appropriate for this question.

Bootstrap Methods – Bootstrapping Hierarchical/Multilevel Data (Resampling Clusters)

Resampling the whole clusters has been known in survey statistics for as long as any resampling methods have been used there at all (which is, since mid 1960s), so it is a well established method. See my collection of links at http://www.citeulike.org/user/ctacmo/tag/survey_resampling. Whether boot can do this or not, I don't know; I use survey package when I need to work with survey bootstraps, although the last time I checked, it did not have all the functionality I needed (like some small sample corrections, as far as I can recall).

I don't think applying a particular model such as fixed effects changes things much, but, IMO, the residual bootstrap makes a lot of strong assumptions (the residuals are i.i.d., the model is correctly specified). Every one of them is easily broken, and the cluster structure surely breaks the i.i.d. assumption.

There's been some econometrics literature on wild cluster bootstrap. They pretended they worked in vacuum without all those fifty years of survey statistics research into the topic, so I am not sure as to what to make of it.

Best Answer

Related Solutions

Solved – Logistic regression: fixed effects for firms, countries & years

Bootstrap Methods – Bootstrapping Hierarchical/Multilevel Data (Resampling Clusters)

Related Question