Standardized Mean Difference – Why Not Measure Post-Matching Balance Within Matched Treatment-Control Pairs

biascase-control-studymatchingstandardized-mean-differencetreatment-effect

I understand there are a number of techniques for evaluating post-matching balance at the covariate level: standardized mean difference (SMD), variance ratios, and empirical CDF statistics.

Are there no post-matching balance measures that calculate the average weighted difference for covariate values within each treatment-control group of matched observations (1:1, 1:2, 1:3, etc.) instead of the overall treatment and control group samples?

Or is it not necessary to achieve balance at this level in order to get unbiased estimates of the Average Treatment Effect on the Treated?

Best Answer

I would say one of the main reasons is that for estimands that rely on mean potential outcomes (e.g., the difference in means, risk ratio, odds ratio), the specific arrangement of the pairs has no relation to bias. That is, if you have a matched sample and then randomly pair the treated and control units, the effect estimate after randomly matching will be identical to the effect estimate after the original match. The philosophy of matching as nonparametric preprocessing argues that the purpose of pairing in matching is as subset selection, i.e., selecting a subset of the original sample in which balance is achieved and bias (i.e., model misspecification) is reduced. Pairing is a way to do this, but it is not the only way, and it itself does not affect bias. There are a number of matching methods that do not involve pairing but are highly effective at achieving balance.

That said, pairwise balance is relevant to the overall bias in a certain way. Rather than thinking about pairwise balance as a property of a given pairing, instead, it is useful to think of the best pairwise balance that could be achieved by a possible pairing in a given matched sample. For example, imagine first that 1:1 matching was done with exact matching for age, so each pair contains units that have equal ages but may be different on other variables. After matching, the distributions of age will be identical between the two treatment groups. Let's say that you now randomly pair those in the matched sample, breaking the original pairs so that age is no longer exactly matched. This does not change the distributions of ages in the matched sample; they will still be identical. Similarly, if it were possible to exactly match on education without discarding any units from this matched sample, that would indicate that the distributions of education were identical, even if the units were not actually matched on education. Again, the pairwise balance of a given pairing is less important than the best possible pairwise balance a matched sample could have under a hypothetical pairing. The closer the best possible pairing is to exact matching, the better the distributional balance of the covariate, and the better overall balance has been attained, regardless of the pairing actually used to create the matched sample or estimate the treatment effect.

The idea of assessing pairwise balance has been discussed by some methodologists in the matching literature. For example, Rubin (1973) recommends the use of two balance statistics to evaluate the quality of a match: $$ \bar d^1=\bar x_1 - \bar x_0 $$ and $$ \bar d^2=\frac{1}{N}\sum (x_{1i} - x_{0i})^2 $$ where the former is the difference in means and the latter is the average squared pairwise differences. Similarly, the measure used as the criterion in optimal matching is $\sum d_i$ where $d_i$ is the distance between the two units in pair, equal to $|x_{1i} - x_{0i}|$ when the distance variable $x$ is univariate (e.g., when propensity score matching). Though not strictly a balance statistic, the failure to achieve small pairwise differences in the distance measure indicates a failure of the matching to achieve balance. The MatchIt package in R produces this statistic for each covariate when any pair matching method is used.

A more complete way to assess balance would be to perform optimal matching within a matched sample using a different variable or set of variables to compute the distance measure and see how good the best balance one can achieve is rather than rely on the pairwise differences of the specific matched specification used to subset the data. For example, after doing matching to subset the data, you can then run optimal matching in the matched dataset without discarding any additional units, using a different variable as the matching variable. If the average pair distance on the matching variable is 0, then the sample is exactly balanced on that variable, even if the original pairing did not yield such closely matched pairs. Similarly, if you take two variables and use them to compute a distance measure (e.g., the Mahalanobis distance), then pair match on that measure in the matched sample, an average pairwise distance of 0 indicates that the groups are exactly matched on both variables and their interaction (i.e., on the joint distribution of those covariates), which is an even stronger form of balance, even if in the original sample they were not so closely paired. This is a bit of a laborious process, especially for many combinations of covariates, but it would give a far more complete picture of balance beyond mean differences and even beyond univariate distribution statistics like the Kolmogorov-Smirnov statistic.

There are issues beyond bias worth considering. Having close pairs decreases the standard error estimate when accounting for pair membership in estimation of the treatment effect. It is also possible for close pairs to reduce sensitivity to unobserved confounding, but only when using somewhat arcane methods to estimate the treatment effect as described in Zubizarreta et al. (2014). For these cases, it makes sense to achieve as low pairwise distances as possible on covariates highly predictive of the outcome.


Rubin, D. B. (1973). Matching to Remove Bias in Observational Studies. Biometrics, 29(1), 159–183. https://doi.org/10.2307/2529684

Zubizarreta, J. R., Paredes, R. D., & Rosenbaum, P. R. (2014). Matching for balance, pairing for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit high schools in Chile. The Annals of Applied Statistics, 8(1), 204–231. https://doi.org/10.1214/13-AOAS713