Propensity Scores – How to Evaluate Success of Propensity Score Matching with Single Metric

covariate-balancematchingpropensity-scoresstandardized-mean-difference

Propensity score matching techniques can be assessed and compared with covariate balance metrics like the standardized mean difference (SMD).

However, SMDs don't account for varying matching rates.

For example, how would you compare Matching Technique A, which achieves 95% matched rate but low covariate balance (SMDs in 0.1 to 0.5 range), with Matching Technique B, which has a 70% matched rate and SMDs all ~0.0?

Best Answer

This is a great question but kind of an impossible task. This points to the fundamental bias-variance tradeoff that is omnipresent in statistics and causal effect estimation in particular. Technique A will probably have lower variance and higher bias, and Technique B will probably have higher variance and lower bias. The question of which has lower means squared error, which in a way is the fundamental question that would indicate which to choose, cannot be answered without knowing more about the data generating process than a researcher has access to.

Here is one way you could proceed. First, run a power analysis to determine the sample size required to detect an effect of interest at a desired level or run an analysis to determine the sample size required to have a confidence interval with a given width. Then, see if Technique B yields a matched sample greater than or equal to that size. If it does, then your bias will be low and you will have the desired precision. If it does not, then you can still proceed with technique B, but know that you are at risk for a wide confidence interval or the possibility of make a type II error (false negative). You may also come to the conclusion that there is no way to reliably detect the effect given the data because the only way to reduce bias in the effect estimate is to decrease precision to an unacceptable degree. That is a fundamental limitation of the dataset and is tantamount to running a randomized trial that is too small to detect an effect.

Another option is to augment the matching with further bias reduction through regression. So, you can use technique A, then further adjust the effect estimate by including the covariates (in particular, the imbalanced variables) in the outcome model. This still leaves you open to all the problems using regression alone has, including extrapolation and inability to prove that you have achieved adequate balance*, but to a lesser degree since the matching has at least partially reduced some of the model dependence.

There is a way to directly visualize the bias-variance tradeoff using a technology called the "matching frontier", which is described in King et al. (2017) and implemented in my R package MatchingFrontier, which isn't yet on CRAN. The matching frontier is a function that relates the size of the matched sample to the (optimal) balance of that sample. This allows you to see how continuing to discard units (e.g., by tightening a caliper) changes balance. It might be that there is a caliper at which balance stops improving, in which case you can use a wider caliper than the one you have been using. You can also estimate treatment effects and confidence intervals across the frontier to see how the effect estimate and confidence interval change as additional units are dropped. You would present the entire frontier to readers so as to not cherry pick the point on the frontier that yields the most favorable result.


The methodology described in Chattopadhyay & Zubizarreta (2022) actually does allow you to assess balance after linear regression in a matched or unmatched sample. We have an R package that implements the methods coming out soon, and if you are interested in using it, get in touch.