Difference-in-Differences Analysis on Aggregate Data

causalitydifference-in-difference

I am currently working on a DID analysis on the causal effect of a vacancy tax on the breaking and entering rate in the city.

The purpose of the tax is to create an incentive for people to put empty and under-utilized homes onto the market. It taxes homes that are empty for more than 6 months in the reference year.

My theoretical framework consists of social disorganization theory and broken window theory. The reasoning is that more empty homes means more opportunity for crimes, so less empty homes means less opportunity for crimes.

The tax (treatment) took effect right away at the beginning of 2017.

I have data of the numbers of breaking and entering for the treatment city and control city from 2014 to 2017 (3 pre-treatment and 1 post-treatment periods). I can get more pre-treatment data if needed but I believe 3 should be enough to establish trend assumption.

Can I use my current data to do a difference in differences analysis? I am new to DID and I am not quite sure how to create an appropriate equation.

Thank you in advance for your help!

Best Answer

You have enough data to estimate the DID effect, but too few data points to calculate standard errors correctly or to probe parallel trends: you have eight data points (and really just two observed 4 times) and you are estimating 4 parameters. This means your ability to extend this result to other places will be very limited since you cannot quantify the uncertainty of the estimated effect. But you can get the effect by hand rather than using regression.

Here is an example using a dataset on the weights of two pigs:

. webuse pig, clear
(Longitudinal analysis of pig weights)

. keep if inlist(id,1,2)
(414 observations deleted)

. keep if week <= 4
(10 observations deleted)

. xtset id week 

Panel variable: id (strongly balanced)
 Time variable: week, 1 to 4
         Delta: 1 unit

. gen treated = id == 2

. gen post    = week == 4

. list, clean noobs

    id   week   weight   treated   post  
     1      1       24         0      0  
     1      2       32         0      0  
     1      3       39         0      0  
     1      4     42.5         0      1  
     2      1     22.5         1      0  
     2      2     30.5         1      0  
     2      3     40.5         1      0  
     2      4       45         1      1  

. diff weight, period(post) treated(treated) cluster(id)

DIFFERENCE-IN-DIFFERENCES ESTIMATION RESULTS
Number of observations in the DIFF-IN-DIFF: 8
            Before         After    
   Control: 3              1           4
   Treated: 3              1           4
            6              2
--------------------------------------------------------
 Outcome var.   | weight  | S. Err. |   |t|   |  P>|t|
----------------+---------+---------+---------+---------
Before          |         |         |         | 
   Control      | 31.667  |         |         | 
   Treated      | 31.167  |         |         | 
   Diff (T-C)   | -0.500  |    .    |    .    |    .
After           |         |         |         | 
   Control      | 42.500  |         |         | 
   Treated      | 45.000  |         |         | 
   Diff (T-C)   | 2.500   |    .    |    .    |    .
                |         |         |         | 
Diff-in-Diff    | 3.000   |    .    |    .    |    .
--------------------------------------------------------
R-square:    0.46
* Means and Standard Errors are estimated by linear regression
**Clustered Std. Errors
**Inference: *** p<0.01; ** p<0.05; * p<0.1

. xtreg weight i.post##i.treated, fe vce(cluster id)
note: 1.treated omitted because of collinearity.

Fixed-effects (within) regression               Number of obs     =          8
Group variable: id                              Number of groups  =          2

R-squared:                                      Obs per group:
     Within  = 0.4568                                         min =          4
     Between = 1.0000                                         avg =        4.0
     Overall = 0.4560                                         max =          4

                                                F(0,1)            =          .
corr(u_i, Xb) = -0.0695                         Prob > F          =          .

                                     (Std. err. adjusted for 2 clusters in id)
------------------------------------------------------------------------------
             |               Robust
      weight | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      1.post |   10.83333          .        .       .            .           .
   1.treated |          0  (omitted)
             |
post#treated |
        1 1  |          3   1.40e-15  2.1e+15   0.000            3           3
             |
       _cons |   31.41667          .        .       .            .           .
-------------+----------------------------------------------------------------
     sigma_u |  .35355339
     sigma_e |  8.2965856
         rho |  .00181269   (fraction of variance due to u_i)
------------------------------------------------------------------------------

Other suggestions:

You may want to normalize B&E by population if your two cities are of unequal size.
If you can get more cities and additional pre-treatment data, you could use a synthetic cohort approach which will work a lot better with a single treated unit than DID.

Related Solutions

Econometrics – Difference-in-Differences with Individual Level Panel Data

A nice feature of difference-in-differences (DiD) is actually that you don't need panel data for it. Given that the treatment happens at some sort of level of aggregation (in your case cities), you only need to sample random individuals from the cities before and after the treatment. This allows you to estimate $$ y_{ist} = A_g + B_t + \beta D_{st} + c X_{ist} + \epsilon_{ist} $$ and get the causal effect of the treatment as the expected post-pre outcome difference for the treated minus the expected post-pre outcome difference for the control.

There is a case in which people use individual fixed effects instead of a treatment indicator and this is when we don't have a well-defined level of aggregation at which the treatment occurs. In that case you would estimate $$ y_{it} = \alpha_i + B_t + \beta D_{it} + cX_{it}+\epsilon_{it} $$ where $D_{it}$ is an indicator for the post-treatment period for individuals who received the treatment (for example, a job market program which happens all over the place). For more information on this see these lecture notes by Steve Pischke.

In your setting, adding individual fixed effects should not change anything with respect to the point estimates. The treatment indicator $A_g$ will just be absorbed by the individual fixed effects. However, these fixed effects might soak up some of the residual variance and therefore potentially reduce the standard error of your DiD coefficient.

Here is a code example which shows that this is the case. I use Stata but you can replicate this in the statistical package of your choice. The "individuals" here are actually countries but they are still grouped according to some treatment indicator.

* load the data set (requires an internet connection)
use "http://dss.princeton.edu/training/Panel101.dta"

* generate the time and treatment group indicators and their interaction
gen time = (year>=1994) & !missing(year)
gen treated = (country>4) & !missing(country)
gen did = time*treated

* do the standard DiD regression
reg y_bin time treated did

------------------------------------------------------------------------------
       y_bin |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        time |       .375   .1212795     3.09   0.003     .1328576    .6171424
     treated |   .4166667   .1434998     2.90   0.005       .13016    .7031734
         did |  -.4027778   .1852575    -2.17   0.033    -.7726563   -.0328992
       _cons |         .5   .0939427     5.32   0.000     .3124373    .6875627
------------------------------------------------------------------------------

 * now repeat the same regression but also including country fixed effects
 areg y_bin did time treated, a(country)

------------------------------------------------------------------------------
       y_bin |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        time |       .375    .120084     3.12   0.003     .1348773    .6151227
     treated |          0  (omitted)
         did |  -.4027778   .1834313    -2.20   0.032    -.7695713   -.0359843
       _cons |   .6785714    .070314     9.65   0.000       .53797    .8191729
-------------+----------------------------------------------------------------

So you see that the DiD coefficient remains the same when the individual fixed effects are included (areg is one of the available fixed effects estimation commands in Stata). The standard errors are slightly tighter and our original treatment indicator was absorbed by the individual fixed effects and therefore dropped in the regression.

In response to the comment
I mentioned the Pischke example to show when people use individual fixed effects rather than a treatment group indicator. Your setting has a well defined group structure so the way you have written your model that's perfectly fine. Standard errors should be clustered at the city level, i.e. the level of aggregation at which the treatment occurs (I haven't done this in the example code but in DiD settings the standard errors need to be corrected as demonstrated by the Bertrand et al paper).

Regarding the movers, they don't have much of a role to play here. The treatment indicator $D_{st}$ is equal to 1 for people who live in a treated city $s$ in the post-treatment period $t$. To compute the DiD coefficient, we actually just need to compute four conditional expectations, namely $$ c = \left[ E(y_{ist}|s=1,t=1) - E(y_{ist}|s=1,t=0)\right] - \left[ E(y_{ist}|s=0,t=1) - E(y_{ist}|s=0,t=0)\right] $$

So if you have 4 post-treatment periods for an individual who lives in a treated city for the first two, and then moves to a control city for the remaining two periods, the first two of those observations will be used in the computation of $E(y_{ist}|s=1,t=1)$ and the last two in $E(y_{ist}|s=0,t=1)$. To make it clear why identification comes from the group differences over time and not from the movers you can visualize this with a simple graph. Suppose the change in the outcome is truly only because of the treatment and that it has a contemporaneous effect. If we have an individual who lives in a treated city after the treatment starts but then moves to a control city, their outcome should go back to what it was before they were treated. This is shown in the stylized graph below.

You might still want to think about movers for other reasons though. For instance, if the treatment has a lasting effect (i.e. it still affects the outcome even though the individual has moved)

Econometrics – Conducting Difference-in-Differences Analysis with Continuous Treatment Variables

Your approach seems valid.

From my quick reading of the Stata manual, -areg maps well to -didregress. You're assessing the effect of the Venezuelan migration share on the mean wage of Peruvians at the city level. From my review of the code, you're estimating the following:

$$ ln(wage_{ict}) = \gamma_c + \lambda_t + \delta MS_{ct} + \epsilon_{ict}, $$

where you observe the mean (logged) wage of survey respondent $i$ within city $c$ and in year $t$. The parameters $\gamma_c$ and $\lambda_t$ denote fixed effects for cities and years, respectively. The treatment variable $MS_{ct}$ is the Venezuelan migration share, which is expressed as a proportion. Since the share varies across cities and is increasing over time in the post-treatment years, it is both $c$- and $t$-subscripted. To be clear, $MS_{ct}$ is 0 in the pre-migration years, then takes on positive values in 2017 onward. Note that this is simply the generalized difference-in-differences equation with a continuous treatment variable. Instead of a discretized version of treatment (i.e., 0/1 for pre-/post-treatment), the variable now represents the different gradations of exposure to migrants across cities and over time.

I would like to know if there is a different way to create a percentage variable.

First, I recommend working in log wages, which should ease the interpretation of $\delta$. Second, try leaving the variable as is. But be careful, as you're working with values bounded between 0 and 1. Say you observe a coefficient of .52 on the exposure variable. You should think about what a meaningful interpretation might be in practice. If you increase the share of migrants by .01, or 1 percentage point, then this increases average wages by about .0052, which is approximately .5 percent.

Please note that I multiplied the coefficient by .1 before assessing it in percentage terms. Again, the immigrant share is expressed as a proportion. The traditional interpretation of a "one-unit" (one percentage point) increase has no well-defined meaning in this context. You didn't enter the exposure variable into the model as a series of values from 0–100, which would make statements about "percentage point increases" a bit more palatable. To overcome this, multiply the coefficient by .01 first before giving it the useful percentage interpretation.

Now say you're more interested in the effect of migration on wages in 10 percent jumps. Well, if you increase the share of migrants by .1, or 10 percentage points, then we should expect mean wages to increase by .052, which is approximately 5 percent. The latter interpretation sounds nice, but you should decide which is more meaningful in the context of overall migration patterns.

On the first try I did not set the values to 0 in the legal migrant share variable for the pre-treatment periods, so the regression $p$-value indicated a statistically significant effect of the treatment coefficient, which did not happen when I set the legal migrant share for the pre-treatment time.

If the values of the exposure variable were not 0 before the first wave of Venezuelans arrived, then what were they? Was there a constant share of non-natives before 2017? Did you exclude all pre-treatment survey waves, which would suggest why you're missing nearly half the observations in the first table? I'm not sure I can address all of these concerns without more information, but what I can say is that it's permissible to have zero values for the exposure variable pre-policy epoch. It seems reasonable, at least to me, that the share of migrants would be 0 before the actual refugee crisis. To acknowledge the comment by @mdewey, I just hope you're not doing this due to a lack of reliable migration data before the exposure.

And lastly, you should be working with repeated cross-sections before and after the migration crisis. You should not be manipulating the pre-exposure migration share to achieve a strong, desirable, or "cool" result. The value of the exposure variable should match the reality of the actual observed legal migration share in that particular city and year.

Best Answer

Related Solutions

Econometrics – Difference-in-Differences with Individual Level Panel Data

Econometrics – Conducting Difference-in-Differences Analysis with Continuous Treatment Variables

Related Question