Solved – Dummies, clustered standard errors or both

categorical dataclustered-standard-errorsregression

Relative novice here. I am running a regression in an observational setting in which Y is the outcome and D is the treatment indicator. Observations are drawn from 3 different geographic groups designated by X. Is the proper approach to:

  1. Regress Y on D and cluster the standard errors by group.
  2. Regress Y on X and D.
  3. Regress Y on X and D and cluster the standard errors by group.

When pursuing option #3 I am seeing much higher statistical significance — and I'm worried somehow that including both dummies and the clustering in a cross-sectional setting is problematic.

In principle, what are the tradeoffs between the 3 approaches? Which is most likely to offer an unbiased estimate of the treatment effect D (assuming other covariates — not included here — are balanced between the 2 groups)?

Best Answer

So with clustered standard errors in your situation you are saying, basically, that you are happy with the stability of the estimate of variance based on three observations, and equally happy to assume that 3 is infinity in terms of using asymptotic normality for your inference. See sec. 8.2.3 of Mostly Harmless Econometrics. 42 is sort of infinity; there is no freaking way 3 is. Moreover, your approach 3) should have broken down as you would not have any degrees of freedom left for clustered standard errors, having more regressors than clusters.

The only approach I would buy in your situation is regressing $Y$ on $D$ or $Y$ on $D \times X$. In the latter case, you would want to test both the main effect of the treatment and its interactions with the regions.

Related Question