Solved – Bins in Regression Discontinuity Designs

binningmultiple regressionregression-discontinuity

Lee and Lemieux (p. 31, 2009) suggest the researcher to also present graphs while doing Regression discontinuity design analysis. They suggest the following procedure:

"…for some bandwidth $h$, and for some number of bins $K_0$ and
$K_1$ to the left and right of the cutoff value, respectively, the
idea is to construct bins ($b_k$,$b_{k+1}$], for $k = 1, . . . ,K =
K_0$+$K_1$, where $b_k = c−(K_0−k+1) \cdot h.$"

c=cutoff point or threshold value of assignment variable
h=bin width.

They then calculate mean values of the outcome within bins and compare the mean outcomes just to the left and right of the cutoff point.

My question is, whether we always should use a fixed bin width $h$. Put differently, would it be legitimate to bin the data such that the number of observations is constant within each bin. The reason for my question is that some parts of my forcing variable are sparely populated, resulting in a noisy graph.

Best Answer

You are making a very good point: in fact, there is a paper by Calonico, Cattaneo, Titiunik (2015a) making the same point and discussing binwidth selectors for quantile-spaced plots.

The paper is a little technical, so you might want to look instead at their R journal paper, see Calonico, Cattaneo and Titiunik (2015b).

You might also want to have a look at the RDD interactive plot online tool: http://shiny.qua.st/rddtools/ which should soon allow also to change the binwidth interactively, but no quantile-spaced plots for the moment.

Refs:

Calonico, S., M. D. Cattaneo, and R. Titiunik. (2015a) Optimal Data-Driven Regression Discontinuity Plots. Journal of the American Statistical Association 110(512): 1753-1769. http://www-personal.umich.edu/~cattaneo/papers/Calonico-Cattaneo-Titiunik_2015_JASA.pdf
Calonico, S., M. D. Cattaneo, and R. Titiunik. (2015b) rdrobust: An R Package for Robust Nonparametric Inference in Regression-Discontinuity Designs. R Journal 7(1): 38-51. http://www-personal.umich.edu/~cattaneo/papers/Calonico-Cattaneo-Titiunik_2015_R.pdf.

Related Solutions

Solved – Regression discontinuity versus matching with spatial discontinuity

I would say matching is more appropriate (and easier), but the logic to each has some comparable aspects worth expounding upon.

Regression discontinuity designs are predicated on the fact that there is some observable relationship between some variable, $X$, and the outcome, $Y$. Then in RDD there is some other exogenous impact that occurs at some threshold of $X$. Note, implicit in the design is that cases are comparable on each side of the threshold (that is, no other differences between the cases exist on each side of the threshold), and so any discontinuity in the effect of $X$ and $Y$ before and after that threshold can be considered the treatment effect.

One of the things that makes RDD in this circumstance difficult is that it is unclear what $X$ is in your circumstance (you could think of many in addition to the distance one you mentioned) and for the social science variables listed, it is unlikely $X$ has a clear/obvious/strong relationship to $Y$. Also I would be skeptical that cases on either side of the threshold are entirely comparable, and so one would want to include other socio-demographic indicators. This can be done, but makes such a quasi-experiment markedly less appealing.

Thus I would suggest matching or estimating propensity score models. You can certainly find a history of examples of matching across the border (see for instance Card & Kreuger, 1994). Also I have seen matching spatial units extended to propensity score models, for instance Ridgeway (2006) uses a flexible set of generalized boosted models to estimate propensity scores for post traffic stop outcomes (e.g. searches, arrests). Such flexible models are attractive because spatial trends can be hard to characterize with such social science data and may take many parameters to effectively model. Also such models are readily capable of including other sets of socio-demographic covariates one would be expected to include in such research designs (for at least the outcomes you mention).

Citations

Card, David & Alan Krueger. 1994. Minimum wages and employment: A case study of the fast-food industry in New Jersey and Pennsylvania. The American Economic Review 84(4):772-793. PDF Here.
Ridgeway, Greg. 2006. Assessing the effect of race bias in post-traffic stop outcomes using propensity scores. Journal of Quantitative Criminology 22(1): 1-29. PDF Here

Solved – Graphs in regression discontinuity design in “Stata” or “R”

Is this much different from doing two local polynomials of degree 2, one for below the threshold and one for above with smooth at $K_i$ points? Here's an example with Stata:

use votex // the election-spending data that comes with rd

tw 
(scatter lne d, mcolor(gs10) msize(tiny)) 
(lpolyci lne d if d<0, bw(0.05) deg(2) n(100) fcolor(none)) 
(lpolyci lne d if d>=0, bw(0.05) deg(2) n(100) fcolor(none)), xline(0)  legend(off)

Alternatively, you can just save the lpoly smoothed values and standard errors as variables instead of using twoway. Below $x$ is the bin, $s$ is the smoothed mean, $se$ is the standard error, and $ul$ and $ll$ are the upper and lower limits of the 95% Confidence Interval for the smoothed outcome.

lpoly lne d if d<0, bw(0.05) deg(2) n(100) gen(x0 s0) ci se(se0)
lpoly lne d if d>=0, bw(0.05) deg(2) n(100) gen(x1 s1) ci se(se1)

/* Get the 95% CIs */
forvalues v=0/1 {
    gen ul`v' = s`v' + 1.95*se`v' 
    gen ll`v' = s`v' - 1.95*se`v' 
};

tw 
(line ul0 ll0 s0 x0, lcolor(blue blue blue) lpattern(dash dash solid)) 
(line ul1 ll1 s1 x1, lcolor(red red red) lpattern(dash dash solid)), legend(off)

As you can see, the lines in the first plot are the same as in the second.

Best Answer

Related Solutions

Solved – Regression discontinuity versus matching with spatial discontinuity

Citations

Solved – Graphs in regression discontinuity design in “Stata” or “R”

Related Question