Distance – Pros and Cons of Using Mahalanobis Distance Instead of Propensity Scores in Matching

distancemahalanobismatchingobservational-studypropensity-scores

I learned about this option of using mahalanobis distance instead of PS to do matching from the matchit() function in R. It seems a more nonparametric approach. Could you state its pros and cons and in what situation it is suitable?

Best Answer

Mahalanobis distance matching (MDM) and propensity score matching (PSM) are methods of doing the same thing, which is to find a subset of control units similar to treated units to arrive at a balanced sample (i.e., where the distribution of covariates is the same in both groups).

MDM works by pairing units that are close based on a distance called the Mahalanobis distance, which you can think of like a scale-free Euclidean distance. For two units to have a Mahalanobis distance of 0, they must have identical covariate values. The more different the covariate values, the larger the Mahalanobis distance. The idea is that if you find control units close to the treated units on the Mahalanobis distance, each pair will have similar covariate values, and the distribution of the covariates in the treatment groups in the matched sample will be similar.

PSM works by pairing units that have similar propensity scores. Propensity scores reduce the entire covariate distribution into a single dimension; this means that two units with similar propensity scores will not necessarily have similar covariates values. However, because of the theoretical balancing properties of the propensity score, PSM can still yield balanced samples, even though any individual matched pair of units may not have similar covariate values.

This difference between the two methods, i.e., that MDM creates pairs close on covariate values while PSM does not (even though both may be effective at yielding balanced samples), is the focus of King & Nielsen's (2019) famous critique of PSM. See the chart below, taken from the 2019 paper: enter image description here

Here, we have the same dataset of treated (red) and control (blue) units, with two covariates (X1, x-axis, and X2, y-axis) being matched on. On the left, MDM is used to pair the units (each gray link is a pair), and on the right, PSM is used. You can see that with MDM, paired units have much more similar covariate values than with PSM. PSM reduces the covariate space to a single dimension, which corresponds to the diagonal line pattern in the plot on the right. Units are paired with each other because they have similar propensity scores, even though they differ quite dramatically on the covariate values.

Why does this matter? King & Nielsen argue that PSM yields fragile and non-robust estimates that could vary wildly depending on the outcome model used. In particular, if you progressively discard units that are far apart from each other (i.e., by imposing a tighter and tighter caliper), eventually balance starts to get worse with PSM even though units close together on the PS remain. They call this the propensity score paradox, which is the motivation for recommending against the use of PSM in favor of potentially more robust methods like MDM that match directly on the covariate space.

So, should we avoid PSM and stick to MDM? No. Rippolone et al. (2018) investigated the impact of the propensity score paradox on real epidemiological data. They found that while the paradox did occur with some data, it was not troublesome until extreme caliper values were used, far beyond what would be recommended. PSM generally yielded good balance on the covariates. In contrast, MDM yielded poor balance in one dataset, sometimes even worse than no matching at all. See the plot below from Ripollone et al. (2018) comparing the balance results from MDM (blue) and PSM (red and green) on one dataset as more units are pruned:

enter image description here

The y-axis is a measure of covariate balance in the matched datasets (unrelated to the pairwise Mahalanobis distance used for matching), and the black dot is the pre-matching balance. We can see that as more units are pruned (moving right along the x-axis), balance gets worse for PSM and better for MDM, but at published recommendations for PSM calipers (vertical dashed lines), balance is excellent for PSM and poor for MDM.

How could we get such different conclusions from comparing the same methods? The answer is that it all depends on the dataset and its unique qualities, including its size, its initial balance, and the number and types of covariates to be matched on. It's worth noting that when analyzing a different dataset than the one depicted above, Ripollone et al. found MDM to yield better balance than PSM. Generally, MDM tends to work better with few covariates and covariates that are normally distributed, whereas PSM tends to work well as long as the propensity score is reasonably well estimated (because the matching is done on the propensity score, not the covariates themselves). The key, though, is that when MDM works, it really works, because it can give matched samples that are not only well-balanced overall, but also containing closely paired units, whereas PSM can only promise well-balanced samples but not close pairs.

What should you do? Normally I would say try both, but this time I will just say to use genetic matching (i.e., method = "genetic" in MatchIt), which combines PSM and MDM and uses optimization to find the distance measure that provides the best balance in the matched dataset. It's much slower than MDM and PSM, but the results will be uniformly better, as many simulation studies have shown. Genetic matching was another method recommended by King & Nielsen for not succumbing to the propensity score paradox. If you can't use genetic matching (e.g., because your dataset is too big or you don't have the time to wait), then you should try both MDM and PSM, and choose the one that yields the best balance measured broadly (i.e., on pairwise covariate distances and KS-statistics and polynomials and interactions of covariates and not just the means). It is straightforward to use MatchIt to quickly try and compare several matching methods before choosing one to move forward with for effect estimation.

There are ways of ensuring close pairs when using PSM, such as exact matching on some covariates, placing calipers on covariates directly, or doing MDM within propensity score calipers. All of these are possible in MatchIt and should be tried and compared.


King, G., & Nielsen, R. (2019). Why Propensity Scores Should Not Be Used for Matching. Political Analysis, 1–20. https://doi.org/10.1017/pan.2019.11

Ripollone, J. E., Huybrechts, K. F., Rothman, K. J., Ferguson, R. E., & Franklin, J. M. (2018). Implications of the Propensity Score Matching Paradox in Pharmacoepidemiology. American Journal of Epidemiology, 187(9), 1951–1961. https://doi.org/10.1093/aje/kwy078

Related Question