The goal of IPTW is to achieve balance. If balance is not achieved by your IPTW specification, can you try to respecify the model or you can use regression in the weighted sample with the imbalanced covariates included to adjust for confounding by those covariates. This is not necessarily the best way to proceed, though. Failing to balance a covariate with the weights means that you are placing the entire burden of adjusting for the covariate onto the outcome regression model. If that model is wrong (and it almost certainly is), confounding will remain. The point of balancing is to make it so that the confounding that remains after covariate adjustment by an incorrect model is as minimal as possible. This is the thesis of Ho, Imai, King, and Stuart (2007).
It doesn't make much sense to remove a covariate from a propensity score model. If that model fails to balance a covariate, you should want to add that covariate into the model in multiple different ways (e.g., squared terms, log terms, interactions, subclasses) to achieve balance, not drop it from the model because the model with it in is doing poorly. Surely a model without the covariate will balance the covariate even worse.
Ideally, you should combine IPTW with an outcome regression model so that the remaining imbalance is accounted for by the outcome regression model and the misspecification of the outcome regression model is mitigated by the balance. There several estimators that combine a propensity score and outcome model; these are called "doubly robust" estimators, and outcome regression in an IPTW-weighted sample is one of them, but there are others.
You should also consider using either optimization-based approaches like entropy balancing, which guarantee balance on the covariate means and have good efficiency properties, or machine learning methods like generalized boosted modeling (GBM) or Bayesian additive regression trees (BART), which attempt to flexibly model the propensity score. These are available in the R package WeightIt
(which I developed). There has been so much work done on new, robust methods with excellent statistical properties that one should not be using the simple methods developed 20 years ago.
I'm going to give you three answers to this question, even though one is enough. In summary, don't use propensity score adjustment. It consistently performs worse than other propensity score methods and adds few, if any, benefits over traditional regression.
The first answer is that you can't. Matching is a "design-based" method, meaning the sample is adjusted without reference to the outcome, similar to the design of a randomized trial. Here, you can assess balance in the sample in a straightforward way by comparing the distributions of covariates between the groups in the matched sample just as you could in the unmatched sample. In contrast, propensity score adjustment is an "analysis-based" method, just like regression adjustment; the sample itself is left intact, and the adjustment occurs through the model. In the same way you can't* assess how well regression adjustment is doing at removing bias due to imbalance, you can't* assess how well propensity score adjustment is doing at removing bias due to imbalance, because as soon as you've fit the model, a treatment effect is estimated and yet the sample is unchanged. Indeed, this is an epistemic weakness of these methods; you can't assess the degree to which confounding due to the measured covariates has been reduced when using regression. Therefore, matching in combination with rigorous balance assessment should be used if your goal is to convince readers that you have truly eliminated substantial bias in the estimate.
The second answer is that Austin (2008) developed a method for assessing balance on covariates when conditioning on the propensity score. The method is as follows:
- Fit a regression model of the covariate on the treatment, the propensity score, and their interaction
- Generate predicted values under treatment and under control for each unit from this model
- Subtract the means of these values
- Divide by the estimated residual standard deviation (if the outcome is continuous) or a standard deviation computed from the predicted probabilities (if the outcome is binary)
This is equivalent to performing g-computation to estimate the effect of the treatment on the covariate adjusting only for the propensity score. If, conditional on the propensity score, there is no association between the treatment and the covariate, then the covariate would no longer induce confounding bias in the propensity score-adjusted outcome model. Of course, this method only tests for mean differences in the covariate, but using other transformations of the covariate in the models can paint a broader picture of balance more holistically for the covariate. Though this methodology is intuitive, there is no empirical evidence for its use, and there will always be scenarios where this method will fail to capture relevant imbalance on the covariates. It also requires a specific correspondence between the outcome model and the models for the covariates, but those models might not be expected to be similar at all (e.g., if they involve different model forms or different assumptions about effect heterogeneity).
The third answer relies on a recent discovery, which is of the "implied" weights of linear regression for estimating the effect of a binary treatment as described by Chattopadhyay and Zubizarreta (2021). Basically, a regression of the outcome on the treatment and covariates is equivalent to the weighted mean difference between the outcome of the treated and the outcome of the control, where the weights take on a specific form based on the form of the regression model. These weights often include negative values, which makes them different from traditional propensity score weights but are conceptually similar otherwise. In theory, you could use these weights to compute weighted balance statistics like you would if you were using propensity score weights. Your outcome model would, of course, be the regression of the outcome on the treatment and propensity score. From that model, you could compute the weights and then compute standardized mean differences and other balance measures. All of this assumes that you are fitting a linear regression model for the outcome. As this is a recently developed methodology, its properties and effectiveness have not been empirically examined, but it has a stronger theoretical basis than Austin's method and allows for a more flexible balance assessment.
What should you do? Don't use propensity score adjustment except as part of a more sophisticated doubly-robust method. If you want to prove to readers that you have eliminated the association between the treatment and covariates in your sample, then use matching or weighting. If you want to rely on the theoretical properties of the propensity score in a robust outcome model, then use a flexible and doubly-robust method like g-computation with the propensity score as one of many covariates or targeted maximum likelihood estimation (TMLE).
Best Answer
Author of
cobalt
here.cobalt
, by default when the estimand is the ATT, uses the standard deviation of the variable in the treated group in the denominator of the SMD. It is unclear howsmd
calculates the denominator of the SMD. The documentation is vague, and attempting to replicate the results from the function using the formula in the documentation fails at recovering the expected result. I have no idea what specific formula that package is using, and I can't figure it out despite my best efforts.The formulas
cobalt
uses are transparent: if you compute the SMDs yourself, you will get the exact answerbal.tab()
reports. For example, forage
, we haveThe
smd
documentation claims to use the formula $d = \frac{\bar x_1 - \bar x_0}{\sqrt{\frac{s_1^2 + s_0^2}{2}}}$. Calculate that yourself using the values in the table:It's still not what
smd()
reports. If you sets.d.denom = "pooled"
inbal.tab()
, you will find the expected SMD is computed as we computed it manually above (with some difference in the 6th decimal value due to rounding).You can arbitrarily flip the signs of the SMD; often people report the absolute SMD so the sign isn't an issue. If you want to know which group has a higher mean, use the means themselves instead of trying to interpret the sign of the SMD.
Don't use the
smd
package for balance assessment, before or after weighting, unless you can accurately explain what it's doing (I can't). Just usecobalt
. It was specifically designed for balance assessment and uses the best practices in the propensity score analysis literature. A lot of thought went into every decision and the defaults reflect those considerations. The documentation and formulas are transparent and the results are what you would expect if you calculated the quantities by hand using best practices.Following up on the comments:
The formula in
cobalt
for SMDs for the ATE is the same as the formula in thesmd
documentation, which I posted above in my answer. Again,cobalt
actually uses this formula; compare the result ofbal.tab()
when usings.d.denom = "pooled"
to the result of the hand calculation I did above. I can't say what formulasmd
actually uses.For categorical variables,
cobalt
splits them into dummies and then uses the same formula with a slight modification, which is that the variances are computed as $s_a^2=\bar x_a(1-\bar x_a)$. This is exactly the formula recommended by Austin (2009).But please note as explained in the documentation that
cobalt
reports the unstandardized mean differences for binary variables by default. To request SMDs, setbinary = "std"
in the call tobal.tab()
. See also this answer, in which I also discuss the differences betweensmd
andcobalt
.smd
uses a particular formula for computing a single "SMD" for categorical variables, whichcobalt
doesn't do (cobalt
computes a balance statistic for each category, not the variable as a whole). I explain why I don't like the statisticsmd
calculates in this answer.Please read the
cobalt
documentation closely, as all of this is explained. Every choice and my motivation for it is explained either in the main vignette or on the documentation page for the function of interest.Here is the result we get after weighting (a simplified version of the table you requested:
Note that I requested standardized mean differences for binary variables. Let's look at
married
to see how the SMD is calculated. Austin's formula is $$ d = \frac{\bar x_1 - \bar x_0}{\sqrt{\frac{s^2_1+s^2_0}{2}}} $$ where, for a binary variable, $s^2_a=\bar x_a(1-\bar x_a)$. Note that this is requested when we sets.d.denom = "pooled"
, which is not the default for the ATT. For the ATT, we use $\sqrt{s^2_1}$ in the denominator, which can be requested manually by settings.d.denom = "treated"
.Looking at the unweighted statistics, we can calculate this by hand.
This is exactly the SMD for
married
in the unweighted sample. We can calculate the SMD formarried
in the weighted sample using the estimated weights. Remember that the denominator doesn't change; we always use the unweighted denominator. So all we need to do is replace the means with the weighted means.That is exactly the value reported under
Diff.Adj
formarried
.If you are seeing discrepancies, maybe you aren't setting the values correctly. The default for binary variables is the unstandardized difference in means; to get the standardized difference, we need to set
binary = "std"
. The default formula in the denominator of the SMD when the estimand is the ATT is $\sqrt{s^2_1}$; to use the pooled standard deviation, which is what Austin uses, you need to sets.d.denom = "pooled"
. Also remember that we always use the unweighted denominator in the SMD.All of this is explained in the documentation for
bal.tab()
. The documentation for usingbal.tab()
withweightit
objects specifically explains hows.d.denom
is set by default. The documentation forcol_w_smd()
(which is the underlying function that calculates the SMD) explains what eachs.d.denom
means.If you're still confused about how a specific number is computed, let me know and I'll explain.