I am not a fan of the Angrist and Pischke book, but they do have a flair for phrasing, and as they say, fuzzy RD is IV (Sec. 6.2). This fact is obscured by the fact that the instrument is essentially a nonlinear transformation (step function) of one of the included exogenous variables, which by virtue of the conditional exogeneity assumption, is a valid instrument.
Assume that each subject is characterized by the tuple of random variables, $\{Y_{0i}, Y_{1i}, D_i, X_i\}$, where $Y_{0i}$ and $Y_{1i}$ are the potential outcomes under non-treatment and treatment respectively, $D_i$ is an indicator variable of whether treatment is administered (which governs which of the potential outcomes is observed for a subject), and $X_i$ is the so-called forcing variable which deterministically or stochastically determines treatment. Usually, the fuzzy RD [FRD] model is stated as the rather concise set of specifications
$$
\begin{align}
\lim_{x\downarrow x_0} \mathbb{E}(D_i\mid X_i = x) &\neq \lim_{x\uparrow x_0} \mathbb{E}(D_i\mid X_i = x)\\
\lim_{x\downarrow x_0} \mathbb{E}(Y_{0i}\mid X_i = x) &= \lim_{x\uparrow x_0} \mathbb{E}(Y_{0i}\mid X_i = x)\\
\end{align}
$$
which are intuitively transparent, but are hard to work with.
Potential outcomes framework
We can use the familiar potential outcomes model to unpack these specifications, where, for the simplicity of exposition, we exclude all other exogenous variables, other than the forcing variable, $X_i$, which deterministically (in the case of RDD) or stochastically (in the case of FRD) determines the treatment assignment ($D_i=1$). The conditional mean of the outcome in terms of the observable variables is given by
$$
\begin{align}
\mathbb{E}(Y_i \mid X_i, D_i) &= \mathbb{E}(Y_{0i}\mid X_i, D_i) + D_i\left(\mathbb{E}(Y_{1i}\mid X_i, D_i)-\mathbb{E}(Y_{0i}\mid X_i, D_i)\right) \\
\end{align}
$$
Here we make no parametric assumptions about the form of the conditional expectation functions. Note that all of these specifications are restricted to the locality of $x_0$, that is $X_i\in [x_0-\Delta_n, x_0+\Delta_n]$, where the indexing by the sample size is for pragmatic reasons (it becomes relevant when we define the estimator).
Recall that in the sharp RD case, we can write $D_i=\mathbf{1}_{[X_i\geq x_0]}$, where $x_0$ is the point of discontinuity. In the FRD case, this relationship is no longer deterministic, instead we have that the conditional mean is modelled in terms of the discontinuity
$$
\begin{align}
\mathbb{E}(D_i\mid X_i) &= \mathbb{P}\left[D_i=1\mid X_i\right]\\
&=(1-\mathbf{1}_{[X_i\geq x_0]})\mathbb{P}\left[D_i=1\mid X_i< x_0\right] + \mathbf{1}_{[X_i\geq x_0]}\mathbb{P}\left[D_i=1\mid X_i\geq x_0\right]
\end{align}
$$
Note that since $X_i$ is exogenous in the system, so is the random variable $\mathbf{1}_{[X_\geq x_0]}$ -- it acts as the excluded exogenous variable in the specification of the conditinal mean of the endogenous variable $D_i$.
Estimation
This is then a valid just-identified IV model, with one endogenous variable $D_i$, and one excluded exogenous variable $\mathbf{1}_{[X_i\geq x_0]}$. A direct and general estimator with no further parametric assumptions is the nonparametric Wald estimator.
$$
\dfrac{\widehat{\mathbb{E}}\left(Y_i \mid x_0 \leq X_i\leq x_0+ \Delta_n \right)-\widehat{\mathbb{E}}\left(Y_i \mid x_0- \Delta_n \leq X_i< x_0\right)}{\widehat{\mathbb{P}}\left[D_i=1\mid x_0 \leq X_i\leq x_0+ \Delta_n \right]-\widehat{\mathbb{P}}\left[D_i=1\mid x_0- \Delta_n \leq X_i< x_0\right]}
$$
Typically local smoothers, like the local linear smoother are used to estimate the conditional mean functions.
ATE interpretation
Note that in order to interpret the given estimator as the average treatment effect [ATE] in the locality of $x_0$, we have used the implausible but routine conditional (on $X_i$) independence of $D_i$ and $Y_{1i}-Y_{0i}$. This allows us to remove the conditioning on $D_i$ in the conditional mean function of the outcome in a mathematically convenient way. For more details, see Hahn, Todd & van der Klauuw (2001), which is an excellent and readable reference for RD models. They also provide interpretations of the parameter being estimated under weaker assumptions.
Is this much different from doing two local polynomials of degree 2, one for below the threshold and one for above with smooth at $K_i$ points? Here's an example with Stata:
use votex // the election-spending data that comes with rd
tw
(scatter lne d, mcolor(gs10) msize(tiny))
(lpolyci lne d if d<0, bw(0.05) deg(2) n(100) fcolor(none))
(lpolyci lne d if d>=0, bw(0.05) deg(2) n(100) fcolor(none)), xline(0) legend(off)
Alternatively, you can just save the lpoly smoothed values and standard errors as variables instead of using twoway
. Below $x$ is the bin, $s$ is the smoothed mean, $se$ is the standard error, and $ul$ and $ll$ are the upper and lower limits of the 95% Confidence Interval for the smoothed outcome.
lpoly lne d if d<0, bw(0.05) deg(2) n(100) gen(x0 s0) ci se(se0)
lpoly lne d if d>=0, bw(0.05) deg(2) n(100) gen(x1 s1) ci se(se1)
/* Get the 95% CIs */
forvalues v=0/1 {
gen ul`v' = s`v' + 1.95*se`v'
gen ll`v' = s`v' - 1.95*se`v'
};
tw
(line ul0 ll0 s0 x0, lcolor(blue blue blue) lpattern(dash dash solid))
(line ul1 ll1 s1 x1, lcolor(red red red) lpattern(dash dash solid)), legend(off)
As you can see, the lines in the first plot are the same as in the second.
Best Answer
This is partial answer. I think you should probably use both the
biprobit
and theivreg/ivreg2
commands to check how robust your effects are. I like thebiprobit
approach given your data, but it does make some strong assumptions (no heteroskedasticity, no hetrogenous effects, normality of errors).* However, there's also a dedicated RD command in Stata calledrdrobust
. It can handle the fuzzy design and may be installed with:You can find an intro to the command in Cattaneo, Calonico, and Titiunik's Stata Journal paper Robust Data-Driven Inference in the Regression-Discontinuity Design.
*Austin Nichols' simulation results indicate that the marginal effects may be less sensitive than the latent index function parameters to biprobit assumption violations. The LPM model is also not always the model of steel that A&P make it out.