Why does MatchIt can calculate mean on a factor column

categorical datamatching

I assume that it is not possible to calculate a mean() from a factor() column with three expressions.

But it seems to me that R's MatchIt package is able to do that. Please see this summary output which reports a mean for all three expressions of the factor column race:

Summary of Balance for Matched Data:
           Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max Std. Pair Dist.
distance          0.5661        0.2111          1.6809     0.7758    0.3466   0.6189          1.6821
age              25.8162       28.9541         -0.4386     0.4006    0.1051   0.2027          1.4887
raceblack         0.8432        0.2351          1.6726          .    0.6081   0.6081          1.6726
racehispan        0.0595        0.1649         -0.4457          .    0.1054   0.1054          0.8114
racewhite         0.0973        0.6000         -1.6962          .    0.5027   0.5027          1.7327
married           0.1892        0.4351         -0.6280          .    0.2459   0.2459          0.8350

This is the minimal working example to reproduce this:

library("MatchIt")
data("lalonde")

# simplify
lalonde = lalonde[,c("treat", "age", "race", "married")]

# 2:1 matching
m.out <- matchit(
    treat ~ age + race + married,
    data = lalonde,
    method = "nearest",
    distance = "glm",
    ratio = 2
)
summary(m.out)

Again I am sure that the race column is a factor with only three expressions:

> table(lalonde$race)

black hispan  white
243     72    299

> class(lalonde$race)
[1] "factor"

I am not able to calculate a mean by myself on that column

> mean(lalonde$race)
[1] NA
Warning message:
In mean.default(lalonde$race) :
  Argument ist weder numerisch noch boolesch: gebe NA zurück

Best Answer

For categorical variables, MatchIt reports the proportion of observations in each category, separately for the treated and the controls. For binary variables, it reports the proportion of 1s. To verify, check that raceblack + racehispan + racewhite sum up to 1 in the Treated and Control columns.

If a random variable takes two values {0, 1}, then $\operatorname{E}\{X\} = \operatorname{Pr}\{X=1\}$. So MatchIt computes the means of one-hot encoded (dummy) indicators, one for each level of a categorical variable.