R – What Are the Possible Solutions to Do Matching on Very Large Datasets?

matchingr

Hi I would like to match a group of treated patients with an untreated group. I have about a million patients in the treatment group and ten times that in the control group. Conventional matching methods and tools can't do this. I'm thinking of methods such as sparse matrix matching. I have seen packages that allow matching on larger databases such as bigmatch, rcbalance. But I didn't find enough documentation on how to implement them. If not can someone suggest me methods to manage one million treated vs 10 million control.
I thought of making several strata based on sex and age then matched on comorbidity score and BMI as a solution. Is there a problem with this approach if it allows me to have a balance?
Or will writing my own matching algorithm allow me to match this large number of patients.

Best Answer

This is a tough problem. If you want to do 1:1 matching, this will inherently be slow. The matching would take place one treated unit at a time, and it would need to search through 10 million control units 1 million times. No optimal matching method, like the ones you mentioned in your question, will be able to handle such a large dataset. bigmatch works by shrinking the distance matrix by imposing the strictest caliper it can before the matching becomes infeasible; I have found this still takes a very long time and often doesn't work at all because the algorithm to find the caliper is slow. Nearest-neighbor matching will be faster, but it too will take a very long time.

There are ways you can speed it up, though. You can perform matching within strata of other variables, which is equivalent to exact matching on those variables. For example, if you had a "region" variable, you could do matching within each region, and you could also do exact matching within each sex-region, or within each sex-race-region, etc. The more variables you can exactly match on, the better your balance will be on those variables and the faster the matching will be.

If you re not tied to 1:1 matching, there are other methods you can use to balance covariates. One is subclassification, in which you divide the sample into strata based on the propensity score (and optionally any other variable). Another is weighting, in which you estimate a weight for each unit based on the propensity score. Both of these methods require fitting a model for the propensity score, but there are regression and machine learning methods that can accommodate such large datasets.

One final option is generalized full matching, which is an extremely fast form of optimal subclassification. It was designed to work with massive datasets like yours and can complete in seconds. It is available in MatchIt by setting method = "quick" or in the quickmatch package.

Related Question