Solved – Testing for difference between incidence rates in R, spatial rdd

proportion;rregression-discontinuityspatialt-test

I am an economics student just starting out with R, and while I'm beginning to be somewhat comfortable with it, I also realize I need to strongly brush up my basic statistics.

I have individual level admin data (in the form of a rotating panel) documenting a reform which was implemented as a pilot project in some regions but not in others, and I want to analyze the effects of the reform in a spatial regression discontinuity design. I have already extended the data with distances and travel time which I acquired via the google maps api.

To get a first impression, I have compiled incidence rates in treated and non-treated regions for different time frames, and different distances to the nearest resp. non-treated/treated location.

This is an excerpt of the results table:

     sample base  ref treated distance  all    recip         irr
1    pp.1yr 2003 2004       0    1e+04  602        5 0.008305648
2    pp.1yr 2003 2004       0    3e+04 6357       39 0.006134969
3    pp.1yr 2003 2004       0    5e+04 8528       57 0.006683865
4    pp.1yr 2003 2004       0    1e+05 9272       62 0.006686799
5    pp.1yr 2003 2004       1    1e+04  435        4 0.009195402
6    pp.1yr 2003 2004       1    3e+04 2438       16 0.006562756
7    pp.1yr 2003 2004       1    5e+04 3456       22 0.006365741
8    pp.1yr 2003 2004       1    1e+05 6360       45 0.007075472
9    pp.2yr 2002 2004       0    1e+04  245        2 0.008163265
10   pp.2yr 2002 2004       0    3e+04 2693       25 0.009283327
11   pp.2yr 2002 2004       0    5e+04 3699       36 0.009732360
12   pp.2yr 2002 2004       0    1e+05 4084       37 0.009059745
13   pp.2yr 2002 2004       1    1e+04  187        1 0.005347594
14   pp.2yr 2002 2004       1    3e+04  983       11 0.011190234
15   pp.2yr 2002 2004       1    5e+04 1400       16 0.011428571
16   pp.2yr 2002 2004       1    1e+05 2660       35 0.013157895

(I can also reshape everything to wide.)

My primary question is simply practical: How do I test in R whether the corresponding rates for treated/non-treated differ significantly, i.e. whether rates in treated regions are significantly smaller than in untreated regions? I was looking at t.test(), but I guess prop.test() is appropriate here.
From eyeballing it doesn't look like there is any effect in the direction I had expected.

Any further info on how to proceed and hints at problems I might encounter or apparent misunderstandings would be nice.

One thing I thought of is that the number of observations is quite small. The main reason besides the rather small target group is the anonymisation of locations in the data, which leaves me with a lot of useless observations because I cannot locate them. There is a distant possibility of this restricion being relaxed so that I can use more data to increase the sample size (esp. in border regions), but I am rather sceptical.

I also remembered to look at pre-/post- differences before treatment and after implementation nationwide.

Any help is greatly appreciated.

Best Answer

With your two sets of binary data (treated versus untreated) and (developed disease versus did not develop disease), an odds ratio would appear to be one way forward. I assume that as you had the data to calculate the incidence rate, you would be able to get back to the raw data and count cases versus non-cases. You would do this analysis instead of the t-test you suggested, and it answers your main question.

This does not answer the part of your question about spatial regression discontinuity, which I am not familiar with. Hopefully someone else can reply on that aspect of your research.

Missing location data will likely bias all analyses you wish to do, as I assume that missing locations also influence your ability to categorise on the basis of treated/untreated regions. Often, missing data is not missing at random, hence why it causes bias. What percentage of your data is missing?