I'm going to give you three answers to this question, even though one is enough. In summary, don't use propensity score adjustment. It consistently performs worse than other propensity score methods and adds few, if any, benefits over traditional regression.
The first answer is that you can't. Matching is a "design-based" method, meaning the sample is adjusted without reference to the outcome, similar to the design of a randomized trial. Here, you can assess balance in the sample in a straightforward way by comparing the distributions of covariates between the groups in the matched sample just as you could in the unmatched sample. In contrast, propensity score adjustment is an "analysis-based" method, just like regression adjustment; the sample itself is left intact, and the adjustment occurs through the model. In the same way you can't* assess how well regression adjustment is doing at removing bias due to imbalance, you can't* assess how well propensity score adjustment is doing at removing bias due to imbalance, because as soon as you've fit the model, a treatment effect is estimated and yet the sample is unchanged. Indeed, this is an epistemic weakness of these methods; you can't assess the degree to which confounding due to the measured covariates has been reduced when using regression. Therefore, matching in combination with rigorous balance assessment should be used if your goal is to convince readers that you have truly eliminated substantial bias in the estimate.
The second answer is that Austin (2008) developed a method for assessing balance on covariates when conditioning on the propensity score. The method is as follows:
- Fit a regression model of the covariate on the treatment, the propensity score, and their interaction
- Generate predicted values under treatment and under control for each unit from this model
- Subtract the means of these values
- Divide by the estimated residual standard deviation (if the outcome is continuous) or a standard deviation computed from the predicted probabilities (if the outcome is binary)
This is equivalent to performing g-computation to estimate the effect of the treatment on the covariate adjusting only for the propensity score. If, conditional on the propensity score, there is no association between the treatment and the covariate, then the covariate would no longer induce confounding bias in the propensity score-adjusted outcome model. Of course, this method only tests for mean differences in the covariate, but using other transformations of the covariate in the models can paint a broader picture of balance more holistically for the covariate. Though this methodology is intuitive, there is no empirical evidence for its use, and there will always be scenarios where this method will fail to capture relevant imbalance on the covariates. It also requires a specific correspondence between the outcome model and the models for the covariates, but those models might not be expected to be similar at all (e.g., if they involve different model forms or different assumptions about effect heterogeneity).
The third answer relies on a recent discovery, which is of the "implied" weights of linear regression for estimating the effect of a binary treatment as described by Chattopadhyay and Zubizarreta (2021). Basically, a regression of the outcome on the treatment and covariates is equivalent to the weighted mean difference between the outcome of the treated and the outcome of the control, where the weights take on a specific form based on the form of the regression model. These weights often include negative values, which makes them different from traditional propensity score weights but are conceptually similar otherwise. In theory, you could use these weights to compute weighted balance statistics like you would if you were using propensity score weights. Your outcome model would, of course, be the regression of the outcome on the treatment and propensity score. From that model, you could compute the weights and then compute standardized mean differences and other balance measures. All of this assumes that you are fitting a linear regression model for the outcome. As this is a recently developed methodology, its properties and effectiveness have not been empirically examined, but it has a stronger theoretical basis than Austin's method and allows for a more flexible balance assessment.
What should you do? Don't use propensity score adjustment except as part of a more sophisticated doubly-robust method. If you want to prove to readers that you have eliminated the association between the treatment and covariates in your sample, then use matching or weighting. If you want to rely on the theoretical properties of the propensity score in a robust outcome model, then use a flexible and doubly-robust method like g-computation with the propensity score as one of many covariates or targeted maximum likelihood estimation (TMLE).
There are two reasons why these values differ. The reason the pre-matching values differ is because of how cobalt
and tableone
compute the denominator of the standardized mean difference. tableone
uses $\sqrt{\frac{s_1^2 + s_0^2}{2}}$ in the denominator of the SMD, whereas cobalt
uses $s_1$ in the denominator (where $s_1$ and $s_0$ are the standard deviations of the covariate in the treated and control groups). This option can be changed in cobalt
; you can set s.d.denom = "pooled"
to use the tableone
version. cobalt
chooses the default standardization factor based on the estimand supplied to matchit()
, which in this case is the ATT, which suggests the treated group is the target population, so the standardization factor should reflect that. See my answer here for some information on that choice. In the end, it doesn't matter too much and results usually won't differ unless there is severe imbalance in the variances of the two groups.
The reason the two results differ after matching is that you failed to include the matching weights in the balance statistics for tableone
. Because you did 4:1 matching with a caliper, not all treated units received 4 matches. Some received 3, some 2, some 1, and some none at all. In this case, matched control units receive different weights depending on how many other control units were matched to their treated unit. For example, if a treated unit only received one matched control unit (because all others were outside the caliper or had already been matched), that control unit would receive a weight of 1, but if a treated unit received four matched control units, each matched control unit would receive a weight of 1/4. The weights are necessary for assessing balance and for use in estimating the treatment effect. cobalt
automatically extracts the weights from the matchit
object and includes them in computing the SMD; tableone
does not unless you supply the weights manually using svyCreateTableOne()
. Even if you use svyCreateTableOne()
, the SMDs will not be calculated correctly because they will use the weighted variance in the calculations, which is inappropriate. See my answer here for more detail about that.
You should use cobalt
for assessing balance. tableone
is great for making nice tables, but there has not been as much care put into making sure balance statistics are computed correctly and consistently for a variety of circumstances because that is not what the package was designed for, whereas cobalt
was designed specifically for assessing balance after using MatchIt
and other packages.
Best Answer
Standardized mean differences are not always presented in absolute value. In
MatchIt
, they are not. A negative SMD just means the control group has a larger mean the the treated group, and a positive SMD just means the treated group has a larger mean than the control group. The cutoffs refer to the absolute value of the SMD, but an SMD of -.15 has the same degree of imbalance as an SMD of .15.