Solved – Fuzzy regression discontinuity design in Stata

instrumental-variablesregression-discontinuitystata

I am currently running computations through a "Fuzzy" Regression discontinuity Design. Suppose my data are in the following form:

$Z$: assignment variable; if $Z > Z_0$ then the person is assigned to the treatment with a certain probability $p_D$ (since we are in the "fuzzy" RDD framework, $p_D<1$).
$D$: treatment status; $D=1$ if the person is treated, 0 otherwise.
$X$: set of exogenous variables.
$Y$: Binary outcome variable.

To my knowledge – see e.g. [1] – running a fuzzy RDD is equivalent to apply Instrumental Variables using $Z$ as instrument (hence at the first stage we should have $D$ regressed on $Z$ and $X$).

In order to estimate the model through Stata I used the following code:

biprobit (Y = X D) (D = X Z)

According to some research I have done – see Nichols' pdf at [2] – the -biprobit- package should be required because of the binary nature of the endogenous variable ($D$).

Do you find the above codes correct? Is it also possible to use a simple linear probability model like this?

ivregress 2sls Y X (D=Z)

Thanks fo any help,

Stefano

[1] Angrist, J. D., Pischke, J. (2008). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press.

[2]: http://www.google.it/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CDQQFjAA&url=http://www.stata.com/meeting/chicago11/materials/chi11_nichols.pdf&ei=GvVnUvKOFIPv4gT-moH4DQ&usg=AFQjCNGv9pmEIOIvhVsmmMq38q05pRbFbg&bvm=bv.55123115,d.bGE

Best Answer

This is partial answer. I think you should probably use both the biprobit and the ivreg/ivreg2 commands to check how robust your effects are. I like the biprobit approach given your data, but it does make some strong assumptions (no heteroskedasticity, no hetrogenous effects, normality of errors).* However, there's also a dedicated RD command in Stata called rdrobust. It can handle the fuzzy design and may be installed with:

net install rdrobust, from(http://www-personal.umich.edu/~cattaneo/rdrobust) replace

You can find an intro to the command in Cattaneo, Calonico, and Titiunik's Stata Journal paper Robust Data-Driven Inference in the Regression-Discontinuity Design.

*Austin Nichols' simulation results indicate that the marginal effects may be less sensitive than the latent index function parameters to biprobit assumption violations. The LPM model is also not always the model of steel that A&P make it out.

Potential outcomes framework

We can use the familiar potential outcomes model to unpack these specifications, where, for the simplicity of exposition, we exclude all other exogenous variables, other than the forcing variable, $X_i$, which deterministically (in the case of RDD) or stochastically (in the case of FRD) determines the treatment assignment ($D_i=1$). The conditional mean of the outcome in terms of the observable variables is given by

$$ \begin{align} \mathbb{E}(Y_i \mid X_i, D_i) &= \mathbb{E}(Y_{0i}\mid X_i, D_i) + D_i\left(\mathbb{E}(Y_{1i}\mid X_i, D_i)-\mathbb{E}(Y_{0i}\mid X_i, D_i)\right) \\ \end{align} $$ Here we make no parametric assumptions about the form of the conditional expectation functions. Note that all of these specifications are restricted to the locality of $x_0$, that is $X_i\in [x_0-\Delta_n, x_0+\Delta_n]$, where the indexing by the sample size is for pragmatic reasons (it becomes relevant when we define the estimator).

Recall that in the sharp RD case, we can write $D_i=\mathbf{1}_{[X_i\geq x_0]}$, where $x_0$ is the point of discontinuity. In the FRD case, this relationship is no longer deterministic, instead we have that the conditional mean is modelled in terms of the discontinuity

$$ \begin{align} \mathbb{E}(D_i\mid X_i) &= \mathbb{P}\left[D_i=1\mid X_i\right]\\ &=(1-\mathbf{1}_{[X_i\geq x_0]})\mathbb{P}\left[D_i=1\mid X_i< x_0\right] + \mathbf{1}_{[X_i\geq x_0]}\mathbb{P}\left[D_i=1\mid X_i\geq x_0\right] \end{align} $$ Note that since $X_i$ is exogenous in the system, so is the random variable $\mathbf{1}_{[X_\geq x_0]}$ -- it acts as the excluded exogenous variable in the specification of the conditinal mean of the endogenous variable $D_i$.

Estimation

This is then a valid just-identified IV model, with one endogenous variable $D_i$, and one excluded exogenous variable $\mathbf{1}_{[X_i\geq x_0]}$. A direct and general estimator with no further parametric assumptions is the nonparametric Wald estimator.

$$ \dfrac{\widehat{\mathbb{E}}\left(Y_i \mid x_0 \leq X_i\leq x_0+ \Delta_n \right)-\widehat{\mathbb{E}}\left(Y_i \mid x_0- \Delta_n \leq X_i< x_0\right)}{\widehat{\mathbb{P}}\left[D_i=1\mid x_0 \leq X_i\leq x_0+ \Delta_n \right]-\widehat{\mathbb{P}}\left[D_i=1\mid x_0- \Delta_n \leq X_i< x_0\right]} $$

Typically local smoothers, like the local linear smoother are used to estimate the conditional mean functions.

ATE interpretation

Note that in order to interpret the given estimator as the average treatment effect [ATE] in the locality of $x_0$, we have used the implausible but routine conditional (on $X_i$) independence of $D_i$ and $Y_{1i}-Y_{0i}$. This allows us to remove the conditioning on $D_i$ in the conditional mean function of the outcome in a mathematically convenient way. For more details, see Hahn, Todd & van der Klauuw (2001), which is an excellent and readable reference for RD models. They also provide interpretations of the parameter being estimated under weaker assumptions.

Solved – Graphs in regression discontinuity design in “Stata” or “R”

Is this much different from doing two local polynomials of degree 2, one for below the threshold and one for above with smooth at $K_i$ points? Here's an example with Stata:

use votex // the election-spending data that comes with rd

tw 
(scatter lne d, mcolor(gs10) msize(tiny)) 
(lpolyci lne d if d<0, bw(0.05) deg(2) n(100) fcolor(none)) 
(lpolyci lne d if d>=0, bw(0.05) deg(2) n(100) fcolor(none)), xline(0)  legend(off)

Alternatively, you can just save the lpoly smoothed values and standard errors as variables instead of using twoway. Below $x$ is the bin, $s$ is the smoothed mean, $se$ is the standard error, and $ul$ and $ll$ are the upper and lower limits of the 95% Confidence Interval for the smoothed outcome.

lpoly lne d if d<0, bw(0.05) deg(2) n(100) gen(x0 s0) ci se(se0)
lpoly lne d if d>=0, bw(0.05) deg(2) n(100) gen(x1 s1) ci se(se1)

/* Get the 95% CIs */
forvalues v=0/1 {
    gen ul`v' = s`v' + 1.95*se`v' 
    gen ll`v' = s`v' - 1.95*se`v' 
};

tw 
(line ul0 ll0 s0 x0, lcolor(blue blue blue) lpattern(dash dash solid)) 
(line ul1 ll1 s1 x1, lcolor(red red red) lpattern(dash dash solid)), legend(off)

As you can see, the lines in the first plot are the same as in the second.

Best Answer

Related Solutions

Solved – Fuzzy regression discontinuity design and exclusion restriction

Potential outcomes framework

Estimation

ATE interpretation

Solved – Graphs in regression discontinuity design in “Stata” or “R”

Related Question