Why does MatchIt can calculate mean on a factor column

categorical datamatching

I assume that it is not possible to calculate a mean() from a factor() column with three expressions.

But it seems to me that R's MatchIt package is able to do that. Please see this summary output which reports a mean for all three expressions of the factor column race:

Summary of Balance for Matched Data:
           Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max Std. Pair Dist.
distance          0.5661        0.2111          1.6809     0.7758    0.3466   0.6189          1.6821
age              25.8162       28.9541         -0.4386     0.4006    0.1051   0.2027          1.4887
raceblack         0.8432        0.2351          1.6726          .    0.6081   0.6081          1.6726
racehispan        0.0595        0.1649         -0.4457          .    0.1054   0.1054          0.8114
racewhite         0.0973        0.6000         -1.6962          .    0.5027   0.5027          1.7327
married           0.1892        0.4351         -0.6280          .    0.2459   0.2459          0.8350

This is the minimal working example to reproduce this:

library("MatchIt")
data("lalonde")

# simplify
lalonde = lalonde[,c("treat", "age", "race", "married")]

# 2:1 matching
m.out <- matchit(
    treat ~ age + race + married,
    data = lalonde,
    method = "nearest",
    distance = "glm",
    ratio = 2
)
summary(m.out)

Again I am sure that the race column is a factor with only three expressions:

> table(lalonde$race)

black hispan  white
243     72    299

> class(lalonde$race)
[1] "factor"

I am not able to calculate a mean by myself on that column

> mean(lalonde$race)
[1] NA
Warning message:
In mean.default(lalonde$race) :
  Argument ist weder numerisch noch boolesch: gebe NA zurück

Best Answer

For categorical variables, MatchIt reports the proportion of observations in each category, separately for the treated and the controls. For binary variables, it reports the proportion of 1s. To verify, check that raceblack + racehispan + racewhite sum up to 1 in the Treated and Control columns.

If a random variable takes two values {0, 1}, then $\operatorname{E}\{X\} = \operatorname{Pr}\{X=1\}$. So MatchIt computes the means of one-hot encoded (dummy) indicators, one for each level of a categorical variable.

Related Solutions

Solved – Why does the intercept column in model.matrix replace the first factor

Consider the following:

require(mlbench)

data(HouseVotes84, package = "mlbench")
head(HouseVotes84)

labels <- model.matrix(~ V1, data=HouseVotes84)
head(labels)

labels1 <- model.matrix(~ V1+1, data=HouseVotes84)
head(labels1)

labels0 <- model.matrix(~ V1+0, data=HouseVotes84)
head(labels0)

labels_1 <- model.matrix(~ V1-1, data=HouseVotes84)
head(labels_1)

The first two commands are identical. The last two commands specifies not to produce the intercept and keeps the two dummy variables produced.

Solved – How to calculate Standardized Mean Difference after matching

On why you and MatchBalance get different values for the SMD:

First, MatchBalance multiplies the SMD by 100, so the actual SMD on the scale of the variable is .11317. That's still much larger than what you get from TableOne and your own calculation. That's because of how you created match_data and computed the SMD with it.

You will notice that match_data has more rows than lalonde, even though in matching you discarded units. That's because the structure of index.treated and index.control is not what you expect when you match with ties. Each time a unit is paired, that pair gets its own entry in those formulas. With ties, one treated unit can be matched to many control units (as many as have the same propensity score as each other). Each control unit that that treated unit is matched with adds an entry to index.treated for that treated unit. So treated unit that is matched with 4 tied control units will have 4 entries in index.treated. That would give them 4 times the weight of another treated unit on your calculation, which is clearly inappropriate because each treated unit should only be counted once, and the contribution of each control unit should correspond to how many ties it has. To address this, Match returns a vector of weights in the weights component, one for each pair, that represents how much that pair should contribute.

The way MatchBalance computes the SMD is by computing the weighted difference in means and dividing by the weighted standard deviation in the treated group. When applying this formula below, we see that we do indeed get the correct answer:

matched_data = lalonde[unlist(rr[c("index.treated","index.control")]),]
w1 <- c(rr$weights, rr$weights)

#Define weighted mean and weighted SD functions
w.m <- function(x, w) sum(x*w)/sum(w)
w.sd <- function(x, w) sqrt(sum(((x - w.m(x,w))^2)*w)/(sum(w)-1))

with(matched_data, (w.m(age[treat==1], w1[treat==1]) - w.m(age[treat==0], w1[treat==0]))/w.sd(age[treat==1], w1[treat==1]))
#> [1] 0.1131677

If instead of dealing with this funky strangely-sized dataset, you want to deal with your original dataset with matching weights, where unmatched units are weighted 0 and matched units are weighted based on how many matches they are a part of, you can use the get.w function in cobalt to extract matching weights from the Match object. These are not the same weights provided by the Match object; the weights returned by get.w have one entry for each unit in the original dataset. We can use the same formula as above with these new weights and you will see the answer is the same:

w2 <- cobalt::get.w(rr, treat = lalonde$treat)
with(lalonde, (w.m(age[treat==1], w2[treat==1]) - w.m(age[treat==0], w2[treat==0]))/w.sd(age[treat==1], w2[treat==1]))
#> [1] 0.1131677

Note that MatchBalance uses the weighted standard deviation of the treated group as the SF; I believe this is inappropriate, so when you run bal.tab in cobalt on the Match output you will not get the same results; the unweighted standard deviation of the treated group is used instead.

Finally, if you turn off ties by setting ties = FALSE in the call to Match, then your formula does work if you modify the standard deviation to be that of the matched treated group because all the weights in the Match object are equal to 1.

Check out my R package cobalt, which was specifically designed for assessing balance after propensity score matching because different packages used different formulas for computing the standardized mean difference (SMD). cobalt provides several options for computing the SMD; it is not a trivial problem. Matching, MatchIt, twang, CBPS, and other packages all use different standards, so I wanted to unify them. You can read more about the motivations for cobalt on its vignette.

The only thing that differs among methods of computing the SMD is the denominator, the standardization factor (SF). There are a few desiderata for a SF that have been implied in the literature:

It should be the same before and after matching to ensure difference before and after matching are not due to changes in the SF but rather to changes in the mean difference
It should reflect the target population of interest

Rubin's early works recommend computing the SF as $\sqrt{\frac{s_1^2 + s_2^2}{2}}$. The What Works Clearinghouse recommends using the small-sample corrected Hedge's $g$, which has its own funky formula (see page 15 of the WWC Procedures Handbook here). You computed the SF simply as the standard deviation of the variable in the combined matched sample. There are many other formulas, which can be controlled in cobalt by using the s.d.denom argument, described in the documentation for the function col_w_smd, which computes (weighted) SMDs.

The standards I use in cobalt are the following:

The SF is always computed in the unadjusted (i.e., pre-matched or unweighted) sample (except in a few cases)
When the estimand is the ATT or ATC, the SF is the standard deviation of the variable in the focal group (i.e., the treated or control group, respectively)
When the estimand is the ATE, the SF is computed using Rubin's formula above

The user has the option of setting s.d.denom to a few other values, which include "hedges" for the small-sample corrected Hedge's $g$, "all" for the standard deviation of the variable in the combine unadjusted sample, or "weighted" for the standard deviation in the combined adjusted sample, which is what you computed.

There are a few unusual cases. Typically when matching one wants the ATT, but if you discard treated units through common support or a caliper, the target population becomes ambiguous. These cases, cobalt treats the estimand as if it were the ATE. When using propensity score weights to estimate the ATO or ATM, the target population is actually defined by the weights, so the SF will be the weighted standard deviation, and the same SF will be used before and after weighting to ensure it is constant. There may be a few other weirdnesses here and there that are described in the documentation.

What should you do? It doesn't matter. The SMD is just a heuristic and its exact value isn't as important as how generally close to zero it is. The different ways of computing the SF will not affect its value in most cases. My advice is to use cobalt's defaults or to choose the one you like and enter it when using cobalt's functions. Make sure you are consistent when reporting the results, and it would be best if you include the formula you use in your report.

Best Answer

Related Solutions

Solved – Why does the intercept column in model.matrix replace the first factor

Solved – How to calculate Standardized Mean Difference after matching

Related Question