Econometrics – Conducting Difference-in-Differences Analysis with Continuous Treatment Variables

difference-in-differenceeconometricsinterpretationstatatreatment-effect

I am currently doing research about the effects on the labour market of Venezuelan migration in Peru. For the first step, I want to get the effects of natives' mean wages in the three biggest cities in terms of population due to the recent mass migration. In order to do this model, I got the yearly mean wages by city, from a large dataset of yearly labour market surveys (Cross-sectional data from 2014 to 2019) and the yearly migration share on cities' population which starts in 2017, which means that the treatment variable is 0 before 2017 and increases every year, since 2017, for each city with different intensity. Before the year 2017, where the mass migration started, there is a parallel trend in mean wages between the 3 cities which also share cultural demographics.

So I tried this code:

didregress (cities_wmean) (legshare_cities, continuous), group(cities) time(year)

cities_wmean: It is a variable which is equal to the cities' mean wage. the value is the same for each respondent within each city, due to previous coding.

legshare_cities: I got the legal migration share which is a proxy for the real migration; this variable goes from 0 to 1 because Stata does not accept a percentage variable. I would like to know if there is a different way to create a percentage variable.

cities: categorical variable that groups the cities' surveys respondents.

On the first try, I did not set the values to 0 in the legal migrant share variable for the pre-treatment periods, so the regression $p$-value indicated a statistically significant effect of the treatment coefficient, which did not happen when I set the legal migrant share for the pre-treatment time. The following graphs shows us this:

didregress (cities_wmean_n) (legshare_cities, continuous), group(cities) time(year) aeq

enter image description here

enter image description here

enter image description here

I would like to know if there is something wrong with this difference-in-differences set up, and what would be the meaning of the treatment coefficient if the set up is correct, and any other suggestions?

Best Answer

Your approach seems valid.

From my quick reading of the Stata manual, -areg maps well to -didregress. You're assessing the effect of the Venezuelan migration share on the mean wage of Peruvians at the city level. From my review of the code, you're estimating the following:

$$ ln(wage_{ict}) = \gamma_c + \lambda_t + \delta MS_{ct} + \epsilon_{ict}, $$

where you observe the mean (logged) wage of survey respondent $i$ within city $c$ and in year $t$. The parameters $\gamma_c$ and $\lambda_t$ denote fixed effects for cities and years, respectively. The treatment variable $MS_{ct}$ is the Venezuelan migration share, which is expressed as a proportion. Since the share varies across cities and is increasing over time in the post-treatment years, it is both $c$- and $t$-subscripted. To be clear, $MS_{ct}$ is 0 in the pre-migration years, then takes on positive values in 2017 onward. Note that this is simply the generalized difference-in-differences equation with a continuous treatment variable. Instead of a discretized version of treatment (i.e., 0/1 for pre-/post-treatment), the variable now represents the different gradations of exposure to migrants across cities and over time.

I would like to know if there is a different way to create a percentage variable.

First, I recommend working in log wages, which should ease the interpretation of $\delta$. Second, try leaving the variable as is. But be careful, as you're working with values bounded between 0 and 1. Say you observe a coefficient of .52 on the exposure variable. You should think about what a meaningful interpretation might be in practice. If you increase the share of migrants by .01, or 1 percentage point, then this increases average wages by about .0052, which is approximately .5 percent.

Please note that I multiplied the coefficient by .1 before assessing it in percentage terms. Again, the immigrant share is expressed as a proportion. The traditional interpretation of a "one-unit" (one percentage point) increase has no well-defined meaning in this context. You didn't enter the exposure variable into the model as a series of values from 0–100, which would make statements about "percentage point increases" a bit more palatable. To overcome this, multiply the coefficient by .01 first before giving it the useful percentage interpretation.

Now say you're more interested in the effect of migration on wages in 10 percent jumps. Well, if you increase the share of migrants by .1, or 10 percentage points, then we should expect mean wages to increase by .052, which is approximately 5 percent. The latter interpretation sounds nice, but you should decide which is more meaningful in the context of overall migration patterns.

On the first try I did not set the values to 0 in the legal migrant share variable for the pre-treatment periods, so the regression $p$-value indicated a statistically significant effect of the treatment coefficient, which did not happen when I set the legal migrant share for the pre-treatment time.

If the values of the exposure variable were not 0 before the first wave of Venezuelans arrived, then what were they? Was there a constant share of non-natives before 2017? Did you exclude all pre-treatment survey waves, which would suggest why you're missing nearly half the observations in the first table? I'm not sure I can address all of these concerns without more information, but what I can say is that it's permissible to have zero values for the exposure variable pre-policy epoch. It seems reasonable, at least to me, that the share of migrants would be 0 before the actual refugee crisis. To acknowledge the comment by @mdewey, I just hope you're not doing this due to a lack of reliable migration data before the exposure.

And lastly, you should be working with repeated cross-sections before and after the migration crisis. You should not be manipulating the pre-exposure migration share to achieve a strong, desirable, or "cool" result. The value of the exposure variable should match the reality of the actual observed legal migration share in that particular city and year.

Related Question