Difference-in-Differences Analysis on Aggregate Data

causalitydifference-in-difference

I am currently working on a DID analysis on the causal effect of a vacancy tax on the breaking and entering rate in the city.

The purpose of the tax is to create an incentive for people to put empty and under-utilized homes onto the market. It taxes homes that are empty for more than 6 months in the reference year.

My theoretical framework consists of social disorganization theory and broken window theory. The reasoning is that more empty homes means more opportunity for crimes, so less empty homes means less opportunity for crimes.

The tax (treatment) took effect right away at the beginning of 2017.

I have data of the numbers of breaking and entering for the treatment city and control city from 2014 to 2017 (3 pre-treatment and 1 post-treatment periods). I can get more pre-treatment data if needed but I believe 3 should be enough to establish trend assumption.

Can I use my current data to do a difference in differences analysis? I am new to DID and I am not quite sure how to create an appropriate equation.

Thank you in advance for your help!

Best Answer

You have enough data to estimate the DID effect, but too few data points to calculate standard errors correctly or to probe parallel trends: you have eight data points (and really just two observed 4 times) and you are estimating 4 parameters. This means your ability to extend this result to other places will be very limited since you cannot quantify the uncertainty of the estimated effect. But you can get the effect by hand rather than using regression.

Here is an example using a dataset on the weights of two pigs:

. webuse pig, clear
(Longitudinal analysis of pig weights)

. keep if inlist(id,1,2)
(414 observations deleted)

. keep if week <= 4
(10 observations deleted)

. xtset id week 

Panel variable: id (strongly balanced)
 Time variable: week, 1 to 4
         Delta: 1 unit

. gen treated = id == 2

. gen post    = week == 4

. list, clean noobs

    id   week   weight   treated   post  
     1      1       24         0      0  
     1      2       32         0      0  
     1      3       39         0      0  
     1      4     42.5         0      1  
     2      1     22.5         1      0  
     2      2     30.5         1      0  
     2      3     40.5         1      0  
     2      4       45         1      1  

. diff weight, period(post) treated(treated) cluster(id)

DIFFERENCE-IN-DIFFERENCES ESTIMATION RESULTS
Number of observations in the DIFF-IN-DIFF: 8
            Before         After    
   Control: 3              1           4
   Treated: 3              1           4
            6              2
--------------------------------------------------------
 Outcome var.   | weight  | S. Err. |   |t|   |  P>|t|
----------------+---------+---------+---------+---------
Before          |         |         |         | 
   Control      | 31.667  |         |         | 
   Treated      | 31.167  |         |         | 
   Diff (T-C)   | -0.500  |    .    |    .    |    .
After           |         |         |         | 
   Control      | 42.500  |         |         | 
   Treated      | 45.000  |         |         | 
   Diff (T-C)   | 2.500   |    .    |    .    |    .
                |         |         |         | 
Diff-in-Diff    | 3.000   |    .    |    .    |    .
--------------------------------------------------------
R-square:    0.46
* Means and Standard Errors are estimated by linear regression
**Clustered Std. Errors
**Inference: *** p<0.01; ** p<0.05; * p<0.1

. xtreg weight i.post##i.treated, fe vce(cluster id)
note: 1.treated omitted because of collinearity.

Fixed-effects (within) regression               Number of obs     =          8
Group variable: id                              Number of groups  =          2

R-squared:                                      Obs per group:
     Within  = 0.4568                                         min =          4
     Between = 1.0000                                         avg =        4.0
     Overall = 0.4560                                         max =          4

                                                F(0,1)            =          .
corr(u_i, Xb) = -0.0695                         Prob > F          =          .

                                     (Std. err. adjusted for 2 clusters in id)
------------------------------------------------------------------------------
             |               Robust
      weight | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      1.post |   10.83333          .        .       .            .           .
   1.treated |          0  (omitted)
             |
post#treated |
        1 1  |          3   1.40e-15  2.1e+15   0.000            3           3
             |
       _cons |   31.41667          .        .       .            .           .
-------------+----------------------------------------------------------------
     sigma_u |  .35355339
     sigma_e |  8.2965856
         rho |  .00181269   (fraction of variance due to u_i)
------------------------------------------------------------------------------

Other suggestions:

  • You may want to normalize B&E by population if your two cities are of unequal size.
  • If you can get more cities and additional pre-treatment data, you could use a synthetic cohort approach which will work a lot better with a single treated unit than DID.
Related Question