Standardized Mean Difference – Formula for Categorical Variables in Cobalt Package

causalitypropensity-scoresrweighted meanweighted-data

I am having troubles in understanding the formula in cobalt package used for standardized mean difference calculation in BINARY variables

data("lalonde", package="cobalt")
library(WeightIt)
library(cobalt)
W.out <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
                  data = lalonde, estimand = "ATE", method = "ps")

table <- bal.tab(W.out, stats = c("m", "v"), thresholds = c(m = .10), disp=c("means", "sds") ,s.d.denom="pooled", un=TRUE,binary = "std"
)

For the unweighted population, I achieve the same results in binary variables by using this formula (Austin, 2009):

smd_bin <- function(x,y){
  z <- x*(1-x)
  t <- y*(1-y)
  k <- sum(z,t)
  l <- k/2
  
return((x-y)/sqrt(l))
  
}

smd_bin(x,y) #x is frequency in group 1, y frequency in group 2 e.g. race_black 0.8432 and 0.2028
smd(0.843243243243243,0.202797202797203)
[1] 1.670826

Which is the R formula for this:
enter image description here

However, when I have to calculate the SMD for the WEIGHTED population, I am having troubles since I don't obtain the same results.
To calculate the SMD in the WEIGHTED population I would apply the same formula as the one I wrote before but with weighted frequencies (Austin,2011), thus:

smd_bin(0.447822556953102,0.397896376833797)
[1] 0.1011917

But the cobalt package calculates it as: 0.130249813461064

Two questions:

  • What is the formula that cobalt package uses to calculate the weighted SMD categorical variables?
  • If it doesn't calculate a weighted SMD, how can I calculate a weighted SMD for categorical variables?

Best Answer

As I mention in my previous answer, cobalt always uses the unweighted variance in the denominator. That means you can't just supply the weighted proportions to your function and hope to get the right results; you need the unweighted proportions to compute the denominator of the SMD.

I have explained this choice here and here (so I won't do it again here). This is the best practice recommended in the literature and is described in the bal.tab() documentation.

So, we can write a new function that takes in the unweighted means and the weighted means.

smd_bin2 <- function(x, y, wx = x, wy = y){
  z <- x * (1 - x)
  t <- y * (1 - y)

  wz <- wx * (1 - wx)
  wt <- wy * (1 - wx)

  k <- z + t
  l <- k/2
  
  return((wx - wy)/sqrt(l))
}

For the unweighted SMD, we supply the unweighted means, as you have done:

> smd_bin2(0.843243243243243, 0.202797202797203)
[1] 1.670826

For the weighted SMD, we additionally supply the weighted means:

> smd_bin2(0.843243243243243,0.202797202797203,
           0.447822556953102,0.397896376833797)
[1] 0.1302498