R – Analyzing Household Fixed Effects with Cross-Sectional Data Using R

difference-in-differencerregression

I'm trying to estimate a difference-in-difference model with pooled cross-sectional data. In order to test the parallel trends assumption, I'm following Common trend assumption by running

Y_it = Household FE + time FE + Sum(j !=k) delta_j * Treated * I(t=j) + X'b + error

My question is the following: in order to add household fixed effects with pooled cross-sectional data, should I create a unique 'household' factor for each year and then combine, or do so with the already combined data?

Best Answer

You're estimating a difference-in-differences (DiD) equation. In most settings, the data is usually 'aggregated up' to a higher level. Treatment in this case is at the district level. Your treatment should affect all households within each district. I would recommend estimating your model using a full set of dummies for all districts and full set of dummies for all years.

Here is what I think you want to estimate:

$$ y_{idt} = \gamma_{d} + \lambda_{t} + \delta D_{dt} + \theta X_{idt} + \epsilon_{idt}, $$

where you observe $i$ households within $d$ districts across $t$ years. $\gamma_{d}$ and $\lambda_{t}$ are fixed effects for districts and years, respectively. Your treatment dummy $D_{dt}$ is at the district-year level. All we need is a sample of households in the relevant districts $d$ in the various years $t$. The intervention (treatment) is well-defined at this higher level of aggregation; it affects all households embedded within districts. The coefficient on $\delta$ is your treatment effect.

In your question, however, you indicate that you want to estimate household fixed effects. You could certainly estimate a model with fixed effects at the $i$-th level, but it will not yield the same DiD estimate.

The equation you are considering is the following:

$$ y_{it} = \alpha_{i} + \lambda_{t} + \delta D_{it} + \theta X_{it} + \epsilon_{it}, $$

where $\alpha_{i}$ now represents household fixed effects. $\gamma_{d}$ is not included in this specification; it will be absorbed by the household fixed effects. There are occasions where inclusion of the individual (household) effects yield identical DiD coefficients. I encourage you to review this post for an example of this.

Your setting is different. We do not observe the same households over time. In one particular year $t$ you might observe a sample of households $i$ from district $d$. In year $t+1$ you sample a new cross-section of households, though it is likely that many households will be sampled again the following year. But you indicated in the comments that nearly one-quarter of each cross-section is a completely new subset of households. Thus, you are not observing the same households over time. Because of this, your estimates will differ. See the second paragraph under Section 1.5 of these lecture notes for more information.

We can think about this more simplistically. Suppose you sample two households from a treated district in 2018, which I will call H1 and H2. In 2019, you resample households again and you observe H1 and H3. You repeat this process yet again and obtain H1 and H2. Note, H3 is never sampled again. If you included dummies for all unique households, then H3 is now a singleton dummy. It is observed in one time period. Again, you could estimate this model, but it will not return the same DiD estimate from the former model where the data is 'aggregated up' to the district level. It also makes assessing parallel trends difficult as the composition of your treatment group is changing over the years.

In sum, you could still estimate this model using household fixed effects. If you sampled a large number of households in each year, and most will be resampled anyway, then you could restrict your sample to households where you have repeated observations over the 22-year period. This ensures you are observing the same households pre- and post-intervention. I also recommend clustering at the district level!

Related Solutions

Solved – Difference between fixed effects models in R (plm) and Stata (xtreg)

Welcome to the site, @gwatson! You are right that effect = "twoways" sets up both "individual" and "year" effects.

I tested with Produc data from R package plm and found the main results are the same (see the codes and outputs below). The only apparent difference I found is the year effect, which is caused by contrast (xtreg sets the first year as reference, while plm directly estimates the effect for each year).

## R code
data("Produc", package = "plm")
zz <- plm(gsp ~ unemp + lag(gsp), data = Produc, index = c("state","year"), method = "within", effect = "twoways")
summary(zz)

## plm output
Coefficients :
            Estimate  Std. Error  t-value  Pr(>|t|)    
unemp    -5.4525e+02  6.8611e+01  -7.9469 7.614e-15 ***
lag(gsp)  1.0125e+00  9.1789e-03 110.3029 < 2.2e-16 ***


## Stata code
use Produc, clear
xtset state year, yearly
xtreg gsp unemp l.gsp i.year, fe

## xtreg output
------------------------------------------------------------------------------
         gsp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       unemp |   -545.246   68.61136    -7.95   0.000    -679.9537   -410.5383
      gsp L1.|   1.012464   .0091789   110.30   0.000     .9944422    1.030485
-------------+----------------------------------------------------------------

Solved – Fixed effect model with household level and state level data

This is a fixed effects model. you should probably cluster your standard errors at the state level. I think it is reasonable to assume the unemployment rate is exogenous. Roughly speaking, any single state resident cannot significantly influence the unemployment rate while the unemployment rate can have significant influence on any single resident's behavior. Education, however could be endogenous since both BMI and education could be linked to an unobserved motivation factor.

If education is endogenous, unless $\hat \beta_{edu}$ and $\hat \beta_{ur}$ are completely uncorrelated, $\hat \beta_{ur}$ will be a biased estimate of the causal effect. from here you could either

Find a REALLY good reason for why education is exogenous (I don't know if this is possible)
include other covariates to control for unobserved confounders, male/female indicators, mother's education, father's education, income, etc.
Find a good instrument for education. Though it's outdated, Angrist and Krueger (1991) use season of birth to instrument education. Labor economists have both criticized and revised on this instrument but it's a start.
Construct some sort of structural equation, such as a simultaneous system, to account for the endogeneity of both BMI and education.

Overall, unless you are trying to publish something, I would just go with (2) from above.

Best Answer

Related Solutions

Solved – Difference between fixed effects models in R (plm) and Stata (xtreg)

Solved – Fixed effect model with household level and state level data

Related Question