I'm going to give you three answers to this question, even though one is enough. In summary, don't use propensity score adjustment. It consistently performs worse than other propensity score methods and adds few, if any, benefits over traditional regression.
The first answer is that you can't. Matching is a "design-based" method, meaning the sample is adjusted without reference to the outcome, similar to the design of a randomized trial. Here, you can assess balance in the sample in a straightforward way by comparing the distributions of covariates between the groups in the matched sample just as you could in the unmatched sample. In contrast, propensity score adjustment is an "analysis-based" method, just like regression adjustment; the sample itself is left intact, and the adjustment occurs through the model. In the same way you can't* assess how well regression adjustment is doing at removing bias due to imbalance, you can't* assess how well propensity score adjustment is doing at removing bias due to imbalance, because as soon as you've fit the model, a treatment effect is estimated and yet the sample is unchanged. Indeed, this is an epistemic weakness of these methods; you can't assess the degree to which confounding due to the measured covariates has been reduced when using regression. Therefore, matching in combination with rigorous balance assessment should be used if your goal is to convince readers that you have truly eliminated substantial bias in the estimate.
The second answer is that Austin (2008) developed a method for assessing balance on covariates when conditioning on the propensity score. The method is as follows:
- Fit a regression model of the covariate on the treatment, the propensity score, and their interaction
- Generate predicted values under treatment and under control for each unit from this model
- Subtract the means of these values
- Divide by the estimated residual standard deviation (if the outcome is continuous) or a standard deviation computed from the predicted probabilities (if the outcome is binary)
This is equivalent to performing g-computation to estimate the effect of the treatment on the covariate adjusting only for the propensity score. If, conditional on the propensity score, there is no association between the treatment and the covariate, then the covariate would no longer induce confounding bias in the propensity score-adjusted outcome model. Of course, this method only tests for mean differences in the covariate, but using other transformations of the covariate in the models can paint a broader picture of balance more holistically for the covariate. Though this methodology is intuitive, there is no empirical evidence for its use, and there will always be scenarios where this method will fail to capture relevant imbalance on the covariates. It also requires a specific correspondence between the outcome model and the models for the covariates, but those models might not be expected to be similar at all (e.g., if they involve different model forms or different assumptions about effect heterogeneity).
The third answer relies on a recent discovery, which is of the "implied" weights of linear regression for estimating the effect of a binary treatment as described by Chattopadhyay and Zubizarreta (2021). Basically, a regression of the outcome on the treatment and covariates is equivalent to the weighted mean difference between the outcome of the treated and the outcome of the control, where the weights take on a specific form based on the form of the regression model. These weights often include negative values, which makes them different from traditional propensity score weights but are conceptually similar otherwise. In theory, you could use these weights to compute weighted balance statistics like you would if you were using propensity score weights. Your outcome model would, of course, be the regression of the outcome on the treatment and propensity score. From that model, you could compute the weights and then compute standardized mean differences and other balance measures. All of this assumes that you are fitting a linear regression model for the outcome. As this is a recently developed methodology, its properties and effectiveness have not been empirically examined, but it has a stronger theoretical basis than Austin's method and allows for a more flexible balance assessment.
What should you do? Don't use propensity score adjustment except as part of a more sophisticated doubly-robust method. If you want to prove to readers that you have eliminated the association between the treatment and covariates in your sample, then use matching or weighting. If you want to rely on the theoretical properties of the propensity score in a robust outcome model, then use a flexible and doubly-robust method like g-computation with the propensity score as one of many covariates or targeted maximum likelihood estimation (TMLE).
I would say one of the main reasons is that for estimands that rely on mean potential outcomes (e.g., the difference in means, risk ratio, odds ratio), the specific arrangement of the pairs has no relation to bias. That is, if you have a matched sample and then randomly pair the treated and control units, the effect estimate after randomly matching will be identical to the effect estimate after the original match. The philosophy of matching as nonparametric preprocessing argues that the purpose of pairing in matching is as subset selection, i.e., selecting a subset of the original sample in which balance is achieved and bias (i.e., model misspecification) is reduced. Pairing is a way to do this, but it is not the only way, and it itself does not affect bias. There are a number of matching methods that do not involve pairing but are highly effective at achieving balance.
That said, pairwise balance is relevant to the overall bias in a certain way. Rather than thinking about pairwise balance as a property of a given pairing, instead, it is useful to think of the best pairwise balance that could be achieved by a possible pairing in a given matched sample. For example, imagine first that 1:1 matching was done with exact matching for age, so each pair contains units that have equal ages but may be different on other variables. After matching, the distributions of age will be identical between the two treatment groups. Let's say that you now randomly pair those in the matched sample, breaking the original pairs so that age is no longer exactly matched. This does not change the distributions of ages in the matched sample; they will still be identical. Similarly, if it were possible to exactly match on education without discarding any units from this matched sample, that would indicate that the distributions of education were identical, even if the units were not actually matched on education. Again, the pairwise balance of a given pairing is less important than the best possible pairwise balance a matched sample could have under a hypothetical pairing. The closer the best possible pairing is to exact matching, the better the distributional balance of the covariate, and the better overall balance has been attained, regardless of the pairing actually used to create the matched sample or estimate the treatment effect.
The idea of assessing pairwise balance has been discussed by some methodologists in the matching literature. For example, Rubin (1973) recommends the use of two balance statistics to evaluate the quality of a match:
$$
\bar d^1=\bar x_1 - \bar x_0
$$
and
$$
\bar d^2=\frac{1}{N}\sum (x_{1i} - x_{0i})^2
$$
where the former is the difference in means and the latter is the average squared pairwise differences. Similarly, the measure used as the criterion in optimal matching is $\sum d_i$ where $d_i$ is the distance between the two units in pair, equal to $|x_{1i} - x_{0i}|$ when the distance variable $x$ is univariate (e.g., when propensity score matching). Though not strictly a balance statistic, the failure to achieve small pairwise differences in the distance measure indicates a failure of the matching to achieve balance. The MatchIt
package in R produces this statistic for each covariate when any pair matching method is used.
A more complete way to assess balance would be to perform optimal matching within a matched sample using a different variable or set of variables to compute the distance measure and see how good the best balance one can achieve is rather than rely on the pairwise differences of the specific matched specification used to subset the data. For example, after doing matching to subset the data, you can then run optimal matching in the matched dataset without discarding any additional units, using a different variable as the matching variable. If the average pair distance on the matching variable is 0, then the sample is exactly balanced on that variable, even if the original pairing did not yield such closely matched pairs. Similarly, if you take two variables and use them to compute a distance measure (e.g., the Mahalanobis distance), then pair match on that measure in the matched sample, an average pairwise distance of 0 indicates that the groups are exactly matched on both variables and their interaction (i.e., on the joint distribution of those covariates), which is an even stronger form of balance, even if in the original sample they were not so closely paired. This is a bit of a laborious process, especially for many combinations of covariates, but it would give a far more complete picture of balance beyond mean differences and even beyond univariate distribution statistics like the Kolmogorov-Smirnov statistic.
There are issues beyond bias worth considering. Having close pairs decreases the standard error estimate when accounting for pair membership in estimation of the treatment effect. It is also possible for close pairs to reduce sensitivity to unobserved confounding, but only when using somewhat arcane methods to estimate the treatment effect as described in Zubizarreta et al. (2014). For these cases, it makes sense to achieve as low pairwise distances as possible on covariates highly predictive of the outcome.
Rubin, D. B. (1973). Matching to Remove Bias in Observational Studies. Biometrics, 29(1), 159–183. https://doi.org/10.2307/2529684
Zubizarreta, J. R., Paredes, R. D., & Rosenbaum, P. R. (2014). Matching for balance, pairing for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit high schools in Chile. The Annals of Applied Statistics, 8(1), 204–231. https://doi.org/10.1214/13-AOAS713
Best Answer
This is a great question but kind of an impossible task. This points to the fundamental bias-variance tradeoff that is omnipresent in statistics and causal effect estimation in particular. Technique A will probably have lower variance and higher bias, and Technique B will probably have higher variance and lower bias. The question of which has lower means squared error, which in a way is the fundamental question that would indicate which to choose, cannot be answered without knowing more about the data generating process than a researcher has access to.
Here is one way you could proceed. First, run a power analysis to determine the sample size required to detect an effect of interest at a desired level or run an analysis to determine the sample size required to have a confidence interval with a given width. Then, see if Technique B yields a matched sample greater than or equal to that size. If it does, then your bias will be low and you will have the desired precision. If it does not, then you can still proceed with technique B, but know that you are at risk for a wide confidence interval or the possibility of make a type II error (false negative). You may also come to the conclusion that there is no way to reliably detect the effect given the data because the only way to reduce bias in the effect estimate is to decrease precision to an unacceptable degree. That is a fundamental limitation of the dataset and is tantamount to running a randomized trial that is too small to detect an effect.
Another option is to augment the matching with further bias reduction through regression. So, you can use technique A, then further adjust the effect estimate by including the covariates (in particular, the imbalanced variables) in the outcome model. This still leaves you open to all the problems using regression alone has, including extrapolation and inability to prove that you have achieved adequate balance*, but to a lesser degree since the matching has at least partially reduced some of the model dependence.
There is a way to directly visualize the bias-variance tradeoff using a technology called the "matching frontier", which is described in King et al. (2017) and implemented in my R package
MatchingFrontier
, which isn't yet on CRAN. The matching frontier is a function that relates the size of the matched sample to the (optimal) balance of that sample. This allows you to see how continuing to discard units (e.g., by tightening a caliper) changes balance. It might be that there is a caliper at which balance stops improving, in which case you can use a wider caliper than the one you have been using. You can also estimate treatment effects and confidence intervals across the frontier to see how the effect estimate and confidence interval change as additional units are dropped. You would present the entire frontier to readers so as to not cherry pick the point on the frontier that yields the most favorable result.The methodology described in Chattopadhyay & Zubizarreta (2022) actually does allow you to assess balance after linear regression in a matched or unmatched sample. We have an R package that implements the methods coming out soon, and if you are interested in using it, get in touch.