R – Analyzing Household Fixed Effects with Cross-Sectional Data Using R

difference-in-differencerregression

I'm trying to estimate a difference-in-difference model with pooled cross-sectional data. In order to test the parallel trends assumption, I'm following Common trend assumption by running

Y_it = Household FE + time FE + Sum(j !=k) delta_j * Treated * I(t=j) + X'b + error

My question is the following: in order to add household fixed effects with pooled cross-sectional data, should I create a unique 'household' factor for each year and then combine, or do so with the already combined data?

Best Answer

You're estimating a difference-in-differences (DiD) equation. In most settings, the data is usually 'aggregated up' to a higher level. Treatment in this case is at the district level. Your treatment should affect all households within each district. I would recommend estimating your model using a full set of dummies for all districts and full set of dummies for all years.

Here is what I think you want to estimate:

$$ y_{idt} = \gamma_{d} + \lambda_{t} + \delta D_{dt} + \theta X_{idt} + \epsilon_{idt}, $$

where you observe $i$ households within $d$ districts across $t$ years. $\gamma_{d}$ and $\lambda_{t}$ are fixed effects for districts and years, respectively. Your treatment dummy $D_{dt}$ is at the district-year level. All we need is a sample of households in the relevant districts $d$ in the various years $t$. The intervention (treatment) is well-defined at this higher level of aggregation; it affects all households embedded within districts. The coefficient on $\delta$ is your treatment effect.

In your question, however, you indicate that you want to estimate household fixed effects. You could certainly estimate a model with fixed effects at the $i$-th level, but it will not yield the same DiD estimate.

The equation you are considering is the following:

$$ y_{it} = \alpha_{i} + \lambda_{t} + \delta D_{it} + \theta X_{it} + \epsilon_{it}, $$

where $\alpha_{i}$ now represents household fixed effects. $\gamma_{d}$ is not included in this specification; it will be absorbed by the household fixed effects. There are occasions where inclusion of the individual (household) effects yield identical DiD coefficients. I encourage you to review this post for an example of this.

Your setting is different. We do not observe the same households over time. In one particular year $t$ you might observe a sample of households $i$ from district $d$. In year $t+1$ you sample a new cross-section of households, though it is likely that many households will be sampled again the following year. But you indicated in the comments that nearly one-quarter of each cross-section is a completely new subset of households. Thus, you are not observing the same households over time. Because of this, your estimates will differ. See the second paragraph under Section 1.5 of these lecture notes for more information.

We can think about this more simplistically. Suppose you sample two households from a treated district in 2018, which I will call H1 and H2. In 2019, you resample households again and you observe H1 and H3. You repeat this process yet again and obtain H1 and H2. Note, H3 is never sampled again. If you included dummies for all unique households, then H3 is now a singleton dummy. It is observed in one time period. Again, you could estimate this model, but it will not return the same DiD estimate from the former model where the data is 'aggregated up' to the district level. It also makes assessing parallel trends difficult as the composition of your treatment group is changing over the years.

In sum, you could still estimate this model using household fixed effects. If you sampled a large number of households in each year, and most will be resampled anyway, then you could restrict your sample to households where you have repeated observations over the 22-year period. This ensures you are observing the same households pre- and post-intervention. I also recommend clustering at the district level!