Solved – Difference in Differences estimation in a log linear model

difference-in-differenceeconometrics

Suppose I try to estimate classical difference in differences model:

$y = \alpha + \beta*T+\gamma*A+\theta*T*A+\epsilon$

where:

$\mathbb{E}[\epsilon|T,A]=0$, $T\in \left\{0,1\right\}$, $A\in \left\{0,1\right\}$

For concreteness let's say that:

$y = $ log of firms;

$T$ is equal to 1 after the reform;

$A$ is equal to 1 for the industries affected by the reform.

Suppose I estimate the following regression separately for $A\in\left\{0,1\right\}$:

$y = \alpha + \beta*T+\epsilon$

The interpretation of estimated coefficient $\hat{\beta}$ would be percentage increase in the number of firms (here I am assuming that $\hat{\beta}$ is small so that Taylor's series approximation used in this approximation is correct) after the reform.

Difference in differences estimate ($\hat{\theta}$) in essence is the difference between these two estimates for control and treatment groups, hence difference in percentages and thus should be interpreted as percentage point difference. Is this correct ? I am asking this because I repeatedly have encountered in papers using log linear models following interpretation "the treatment effect is X% and not X percentage points when the dependant variable is in logs".

Best Answer

The treatment effect in your case is in percent and not in percentage points. To see this, consider the following simple example with two groups and two time periods for before and after the intervention. Let the group means be

  y   time    group         lny
110      0        0   4.7004805
111      1        0   4.7095304
110      0        1   4.7004805
115      1        1   4.7449322

where $y$ is the outcome, time indexes the periods (0 before, 1 after treatment), and group indexes the control (0) and treatment (1) group, and $\ln y$ is the log transformed outcome.

Suppose that the outcome trends for the two groups was the same before the treatment, then the difference in differences estimate would be $$\beta_{y,DiD} = [115 - 110] - [111 - 110] = 4 $$

so the treatment group increased their outcome by 4 units more than the control (which is the same as saying that the treatment group increased their outcome by 4 more units compared to the counterfactual when the treatment had not happened).

When you compare the growth rates in each group, you get a percent increase of $\frac{111-110}{110}\cdot 100 = 0.91$% in the control, and $\frac{115-110}{110}\cdot 100 = 4.54$% in the treatment group. This is approximately the same as taking the differences in the logs between period 1 and 0 in each group (this approximation works well because the differences are not too big). So the difference-in-differences in the growth rates are 4.54% - 0.91% = 3.6%. The DiD estimate using the log transformed outcome is $$\beta_{\ln y,DiD} = [4.7449322 - 4.7004805] - [4.7095304 - 4.7004805] = 0.0354 $$

and multiplying this by hundred gives you approximately again the difference that we had before. The exact result is given by (exp(.0354018)-1)*100 = 3.6

In the absence of the treatment, the treated group would have had 111 in the second period instead of 115. The percentage change from 111 to 115 is $\frac{115 - 111}{111} \cdot 100 = 3.6$

So the papers that you were reading are right. If you want to estimate a DiD coefficient that estimates a percentage point increase, this works in case of a $y \in [0,100]$, i.e. an outcome on a percent scale. Regressing such an outcome on the time and group indicator, and their interaction will yield a DiD coefficient that shows a percentage point increase for the treatment group. Your other point is correct when you say that regressing $$y = \alpha + \beta T + \epsilon$$ in each group separately gives the time change for the outcome in each group. When you then compute $\beta_1 - \beta_0$ this will again give you the DiD estimate.