Standardized Mean Difference – Discrepancy Between Standardized Mean Difference in Cobalt and SMD Packages

propensity-scoresstandardized-mean-differenceweighted-dataweights

Let's suppose I have weighted my population using WeightIt

data("lalonde", package = "cobalt")
library(cobalt)
library(WeightIt)

W.out <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
                  data = lalonde, estimand = "ATT", method = "ps")
W.out

bal.tab(W.out, stats = c("m", "v"), thresholds = c(m = .05), disp=c("means", "sds"), un=TRUE)

The standardized mean differences provided by cobalt before weighting in the unbalanced population are the following:

enter image description here

However, when I use the package smd, the standardized mean differences of the unweighted population appear to be different, for example:

library(smd)
smd(lalonde$age, lalonde$treat)
smd(lalonde$educ, lalonde$treat)

enter image description here

Some questions:

  • Why are the numbers completely different in the unweighted population between cobalt package and smd?? It is not that the numbers are just different but they are COMPLETELY different in a meaningful way, like in a case SD<10% if I use one package and SD>10% if I use another.. What is the formula used in the unbalanced population for SMD in the cobalt package? Even if I put in the cobalt code s.d.denom="pooled" the results from the SMD package still differ.
  • Since SMD is basically a subtraction, if I want to have SMD of treated – non treated, I can just remove the minus sign from smd package results, right?
  • Is there a way to calculate the SMD in the weighted population using cobalt with the same formula as SMD package calculates it? If I put in the cobalt code s.d.denom="pooled" the results from the SMD package still differ.
  • Is it reasonable to use smd package BEFORE weighting and cobalt package for SMD after weighting?

Best Answer

Author of cobalt here. cobalt, by default when the estimand is the ATT, uses the standard deviation of the variable in the treated group in the denominator of the SMD. It is unclear how smd calculates the denominator of the SMD. The documentation is vague, and attempting to replicate the results from the function using the formula in the documentation fails at recovering the expected result. I have no idea what specific formula that package is using, and I can't figure it out despite my best efforts.

The formulas cobalt uses are transparent: if you compute the SMDs yourself, you will get the exact answer bal.tab() reports. For example, for age, we have

> (25.8162 - 28.0303) / 7.1550
[1] -0.3094479

The smd documentation claims to use the formula $d = \frac{\bar x_1 - \bar x_0}{\sqrt{\frac{s_1^2 + s_0^2}{2}}}$. Calculate that yourself using the values in the table:

> (25.8162 - 28.0303) / sqrt((7.1550^2 + 10.7867^2)/2)
[1] -0.2419045

It's still not what smd() reports. If you set s.d.denom = "pooled" in bal.tab(), you will find the expected SMD is computed as we computed it manually above (with some difference in the 6th decimal value due to rounding).

You can arbitrarily flip the signs of the SMD; often people report the absolute SMD so the sign isn't an issue. If you want to know which group has a higher mean, use the means themselves instead of trying to interpret the sign of the SMD.

Don't use the smd package for balance assessment, before or after weighting, unless you can accurately explain what it's doing (I can't). Just use cobalt. It was specifically designed for balance assessment and uses the best practices in the propensity score analysis literature. A lot of thought went into every decision and the defaults reflect those considerations. The documentation and formulas are transparent and the results are what you would expect if you calculated the quantities by hand using best practices.


Following up on the comments:

The formula in cobalt for SMDs for the ATE is the same as the formula in the smd documentation, which I posted above in my answer. Again, cobalt actually uses this formula; compare the result of bal.tab() when using s.d.denom = "pooled" to the result of the hand calculation I did above. I can't say what formula smd actually uses.

For categorical variables, cobalt splits them into dummies and then uses the same formula with a slight modification, which is that the variances are computed as $s_a^2=\bar x_a(1-\bar x_a)$. This is exactly the formula recommended by Austin (2009).

But please note as explained in the documentation that cobalt reports the unstandardized mean differences for binary variables by default. To request SMDs, set binary = "std" in the call to bal.tab(). See also this answer, in which I also discuss the differences between smd and cobalt. smd uses a particular formula for computing a single "SMD" for categorical variables, which cobalt doesn't do (cobalt computes a balance statistic for each category, not the variable as a whole). I explain why I don't like the statistic smd calculates in this answer.

Please read the cobalt documentation closely, as all of this is explained. Every choice and my motivation for it is explained either in the main vignette or on the documentation page for the function of interest.


Here is the result we get after weighting (a simplified version of the table you requested:

> bal.tab(W.out, disp = c("means", "sds"), un = TRUE,
          binary = "std", s.d.denom = "pooled")
Balance Measures
                Type    M.0.Un   SD.0.Un    M.1.Un   SD.1.Un Diff.Un   M.0.Adj  SD.0.Adj   M.1.Adj  SD.1.Adj Diff.Adj
prop.score  Distance    0.1822    0.2295    0.5774    0.2203  1.7569    0.5820    0.2168    0.5774    0.2203  -0.0201
age          Contin.   28.0303   10.7867   25.8162    7.1550 -0.2419   24.9658   10.5754   25.8162    7.1550   0.0929
educ         Contin.   10.2354    2.8552   10.3459    2.0107  0.0448   10.4031    2.4681   10.3459    2.0107  -0.0231
race_black    Binary    0.2028    0.4021    0.8432    0.3636  1.6708    0.8455    0.3614    0.8432    0.3636  -0.0058
race_hispan   Binary    0.1422    0.3492    0.0595    0.2365 -0.2774    0.0593    0.2362    0.0595    0.2365   0.0006
race_white    Binary    0.6550    0.4754    0.0973    0.2964 -1.4080    0.0952    0.2935    0.0973    0.2964   0.0052
married       Binary    0.5128    0.4998    0.1892    0.3917 -0.7208    0.1706    0.3761    0.1892    0.3917   0.0414
nodegree      Binary    0.5967    0.4906    0.7081    0.4546  0.2355    0.6897    0.4626    0.7081    0.4546   0.0390
re74         Contin. 5619.2365 6788.7508 2095.5737 4886.6204 -0.5958 2106.0448 4252.2469 2095.5737 4886.6204  -0.0018
re75         Contin. 2466.4844 3291.9962 1532.0553 3219.2509 -0.2870 1496.5412 2726.7838 1532.0553 3219.2509   0.0109

Effective sample sizes
           Control Treated
Unadjusted  429.       185
Adjusted     99.82     185

Note that I requested standardized mean differences for binary variables. Let's look at married to see how the SMD is calculated. Austin's formula is $$ d = \frac{\bar x_1 - \bar x_0}{\sqrt{\frac{s^2_1+s^2_0}{2}}} $$ where, for a binary variable, $s^2_a=\bar x_a(1-\bar x_a)$. Note that this is requested when we set s.d.denom = "pooled", which is not the default for the ATT. For the ATT, we use $\sqrt{s^2_1}$ in the denominator, which can be requested manually by setting s.d.denom = "treated".

Looking at the unweighted statistics, we can calculate this by hand.

> x <- lalonde$married
> t <- lalonde$treat
> 
> m_1 <- mean(x[t==1])
> m_0 <- mean(x[t==0])
> 
> s2_1 <- m_1 * (1 - m_1)
> s2_0 <- m_0 * (1 - m_0)
> 
> (m_1 - m_0) / sqrt((s2_1 + s2_0) / 2)
[1] -0.7207554

This is exactly the SMD for married in the unweighted sample. We can calculate the SMD for married in the weighted sample using the estimated weights. Remember that the denominator doesn't change; we always use the unweighted denominator. So all we need to do is replace the means with the weighted means.

> w <- W.out$weights
> m_1 <- weighted.mean(x[t==1], w[t==1])
> m_0 <- weighted.mean(x[t==0], w[t==0])
> 
> (m_1 - m_0) / sqrt((s2_1 + s2_0) / 2)
[1] 0.04144392

That is exactly the value reported under Diff.Adj for married.

If you are seeing discrepancies, maybe you aren't setting the values correctly. The default for binary variables is the unstandardized difference in means; to get the standardized difference, we need to set binary = "std". The default formula in the denominator of the SMD when the estimand is the ATT is $\sqrt{s^2_1}$; to use the pooled standard deviation, which is what Austin uses, you need to set s.d.denom = "pooled". Also remember that we always use the unweighted denominator in the SMD.

All of this is explained in the documentation for bal.tab(). The documentation for using bal.tab() with weightit objects specifically explains how s.d.denom is set by default. The documentation for col_w_smd() (which is the underlying function that calculates the SMD) explains what each s.d.denom means.

If you're still confused about how a specific number is computed, let me know and I'll explain.