Econometrics – Conducting Difference-in-Differences Analysis with Continuous Treatment Variables

difference-in-differenceeconometricsinterpretationstatatreatment-effect

I am currently doing research about the effects on the labour market of Venezuelan migration in Peru. For the first step, I want to get the effects of natives' mean wages in the three biggest cities in terms of population due to the recent mass migration. In order to do this model, I got the yearly mean wages by city, from a large dataset of yearly labour market surveys (Cross-sectional data from 2014 to 2019) and the yearly migration share on cities' population which starts in 2017, which means that the treatment variable is 0 before 2017 and increases every year, since 2017, for each city with different intensity. Before the year 2017, where the mass migration started, there is a parallel trend in mean wages between the 3 cities which also share cultural demographics.

So I tried this code:

didregress (cities_wmean) (legshare_cities, continuous), group(cities) time(year)

cities_wmean: It is a variable which is equal to the cities' mean wage. the value is the same for each respondent within each city, due to previous coding.

legshare_cities: I got the legal migration share which is a proxy for the real migration; this variable goes from 0 to 1 because Stata does not accept a percentage variable. I would like to know if there is a different way to create a percentage variable.

cities: categorical variable that groups the cities' surveys respondents.

On the first try, I did not set the values to 0 in the legal migrant share variable for the pre-treatment periods, so the regression $p$-value indicated a statistically significant effect of the treatment coefficient, which did not happen when I set the legal migrant share for the pre-treatment time. The following graphs shows us this:

didregress (cities_wmean_n) (legshare_cities, continuous), group(cities) time(year) aeq

I would like to know if there is something wrong with this difference-in-differences set up, and what would be the meaning of the treatment coefficient if the set up is correct, and any other suggestions?

Best Answer

Your approach seems valid.

From my quick reading of the Stata manual, -areg maps well to -didregress. You're assessing the effect of the Venezuelan migration share on the mean wage of Peruvians at the city level. From my review of the code, you're estimating the following:

$$ ln(wage_{ict}) = \gamma_c + \lambda_t + \delta MS_{ct} + \epsilon_{ict}, $$

where you observe the mean (logged) wage of survey respondent $i$ within city $c$ and in year $t$. The parameters $\gamma_c$ and $\lambda_t$ denote fixed effects for cities and years, respectively. The treatment variable $MS_{ct}$ is the Venezuelan migration share, which is expressed as a proportion. Since the share varies across cities and is increasing over time in the post-treatment years, it is both $c$- and $t$-subscripted. To be clear, $MS_{ct}$ is 0 in the pre-migration years, then takes on positive values in 2017 onward. Note that this is simply the generalized difference-in-differences equation with a continuous treatment variable. Instead of a discretized version of treatment (i.e., 0/1 for pre-/post-treatment), the variable now represents the different gradations of exposure to migrants across cities and over time.

I would like to know if there is a different way to create a percentage variable.

First, I recommend working in log wages, which should ease the interpretation of $\delta$. Second, try leaving the variable as is. But be careful, as you're working with values bounded between 0 and 1. Say you observe a coefficient of .52 on the exposure variable. You should think about what a meaningful interpretation might be in practice. If you increase the share of migrants by .01, or 1 percentage point, then this increases average wages by about .0052, which is approximately .5 percent.

Please note that I multiplied the coefficient by .1 before assessing it in percentage terms. Again, the immigrant share is expressed as a proportion. The traditional interpretation of a "one-unit" (one percentage point) increase has no well-defined meaning in this context. You didn't enter the exposure variable into the model as a series of values from 0–100, which would make statements about "percentage point increases" a bit more palatable. To overcome this, multiply the coefficient by .01 first before giving it the useful percentage interpretation.

Now say you're more interested in the effect of migration on wages in 10 percent jumps. Well, if you increase the share of migrants by .1, or 10 percentage points, then we should expect mean wages to increase by .052, which is approximately 5 percent. The latter interpretation sounds nice, but you should decide which is more meaningful in the context of overall migration patterns.

On the first try I did not set the values to 0 in the legal migrant share variable for the pre-treatment periods, so the regression $p$-value indicated a statistically significant effect of the treatment coefficient, which did not happen when I set the legal migrant share for the pre-treatment time.

If the values of the exposure variable were not 0 before the first wave of Venezuelans arrived, then what were they? Was there a constant share of non-natives before 2017? Did you exclude all pre-treatment survey waves, which would suggest why you're missing nearly half the observations in the first table? I'm not sure I can address all of these concerns without more information, but what I can say is that it's permissible to have zero values for the exposure variable pre-policy epoch. It seems reasonable, at least to me, that the share of migrants would be 0 before the actual refugee crisis. To acknowledge the comment by @mdewey, I just hope you're not doing this due to a lack of reliable migration data before the exposure.

And lastly, you should be working with repeated cross-sections before and after the migration crisis. You should not be manipulating the pre-exposure migration share to achieve a strong, desirable, or "cool" result. The value of the exposure variable should match the reality of the actual observed legal migration share in that particular city and year.

Related Solutions

Difference-in-Differences Analysis on Aggregate Data

You have enough data to estimate the DID effect, but too few data points to calculate standard errors correctly or to probe parallel trends: you have eight data points (and really just two observed 4 times) and you are estimating 4 parameters. This means your ability to extend this result to other places will be very limited since you cannot quantify the uncertainty of the estimated effect. But you can get the effect by hand rather than using regression.

Here is an example using a dataset on the weights of two pigs:

. webuse pig, clear
(Longitudinal analysis of pig weights)

. keep if inlist(id,1,2)
(414 observations deleted)

. keep if week <= 4
(10 observations deleted)

. xtset id week 

Panel variable: id (strongly balanced)
 Time variable: week, 1 to 4
         Delta: 1 unit

. gen treated = id == 2

. gen post    = week == 4

. list, clean noobs

    id   week   weight   treated   post  
     1      1       24         0      0  
     1      2       32         0      0  
     1      3       39         0      0  
     1      4     42.5         0      1  
     2      1     22.5         1      0  
     2      2     30.5         1      0  
     2      3     40.5         1      0  
     2      4       45         1      1  

. diff weight, period(post) treated(treated) cluster(id)

DIFFERENCE-IN-DIFFERENCES ESTIMATION RESULTS
Number of observations in the DIFF-IN-DIFF: 8
            Before         After    
   Control: 3              1           4
   Treated: 3              1           4
            6              2
--------------------------------------------------------
 Outcome var.   | weight  | S. Err. |   |t|   |  P>|t|
----------------+---------+---------+---------+---------
Before          |         |         |         | 
   Control      | 31.667  |         |         | 
   Treated      | 31.167  |         |         | 
   Diff (T-C)   | -0.500  |    .    |    .    |    .
After           |         |         |         | 
   Control      | 42.500  |         |         | 
   Treated      | 45.000  |         |         | 
   Diff (T-C)   | 2.500   |    .    |    .    |    .
                |         |         |         | 
Diff-in-Diff    | 3.000   |    .    |    .    |    .
--------------------------------------------------------
R-square:    0.46
* Means and Standard Errors are estimated by linear regression
**Clustered Std. Errors
**Inference: *** p<0.01; ** p<0.05; * p<0.1

. xtreg weight i.post##i.treated, fe vce(cluster id)
note: 1.treated omitted because of collinearity.

Fixed-effects (within) regression               Number of obs     =          8
Group variable: id                              Number of groups  =          2

R-squared:                                      Obs per group:
     Within  = 0.4568                                         min =          4
     Between = 1.0000                                         avg =        4.0
     Overall = 0.4560                                         max =          4

                                                F(0,1)            =          .
corr(u_i, Xb) = -0.0695                         Prob > F          =          .

                                     (Std. err. adjusted for 2 clusters in id)
------------------------------------------------------------------------------
             |               Robust
      weight | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      1.post |   10.83333          .        .       .            .           .
   1.treated |          0  (omitted)
             |
post#treated |
        1 1  |          3   1.40e-15  2.1e+15   0.000            3           3
             |
       _cons |   31.41667          .        .       .            .           .
-------------+----------------------------------------------------------------
     sigma_u |  .35355339
     sigma_e |  8.2965856
         rho |  .00181269   (fraction of variance due to u_i)
------------------------------------------------------------------------------

Other suggestions:

You may want to normalize B&E by population if your two cities are of unequal size.
If you can get more cities and additional pre-treatment data, you could use a synthetic cohort approach which will work a lot better with a single treated unit than DID.

Best Answer

Related Solutions

Difference-in-Differences Analysis on Aggregate Data

Related Question