Solved – Propensity score matching with large data

large datarsas

I have a large healthcare claims database with 1.6 million subjects and I'm interested in doing a cohort study with propensity score matching. I have produced my propensity score with a logistic model. The problem is that I have about 260,000 subjects with the exposure to match, ideally in a 1:3 ratio, to the rest of the sample.

I have tried MatchIt in R, subdividing my sample into zip code level areas (basically doing exact matching for zipcode then looking for nearest PS). This is fast because MatchIt can handle many small datasets easily, but the final matched data set isn't as balanced as it should be with so many controls.

MatchIt basically crashes when I try to match more than 30,000 or so subjects at a time.* I tried to use SAS on our department's fast UNIX server using this macro, but it also crashed and is taking many, many hours. I think there must be a better way, given that I don't think my dataset is that huge?

So, my question is how would you do 1:3 matching with a dataset of this size? I don't need something blazing fast, but I just want to be confident I'll get reliable output after a few hours.

*BIG caveat: For any R solution that may need hours to run, I'm restricted to using 32 bit R on my office server which is a big bummer.

Best Answer

Have you tried the nearest neighbour in MatchIt (method = "nearest")? As it is a "greedy" algorithm, it should be fast even for larger sample sizes. If for some reason that does not work, you could program the nearest neighbour yourself allowing for 3 matches before an observation in the treatment group is "used up". Obviously the matching will be quite suboptimal, but it might be a sensible solution where the dataset is too large for "optimal" matching.

Related Question