Solved – Categorical variable as control variable in MatchIt

categorical datamatchingpropensity-scoresr

I'm kind of new to R and trying to run propensity score matching using the MatchIt module.
Some of my control variables are continous but some of them are categorical. For example, I have a "currency" variable that contains multiple currencies. I put all the variabes as controls while calling to MatchIt, but I'm not sure it was right…
The summary of the matchshow me the following differences for the currency variable:
Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean eQQ Max
CurrencyCAD 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
CurrencyEUR 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
CurrencyGBP 0.1509 0.1509 0.3614 0.0000 0.0000 0.0000 0.0000
CurrencyNZD 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
CurrencyUSD 0.8302 0.8302 0.3791 0.0000 0.0000 0.0000 0.0000
for another categorical variable ("Hobby"), it showed me values other than 0 or 100.. what does it mean?

HobbyPhotography 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
HobbyDance 0.0189 0.0000 0.0000 0.0189 0.0000 0.0189 1.0000
HobbyTechnology 0.0189 0.0377 0.1924 -0.0189 0.0000 0.0189 1.0000

In addition, I have some binary variables of 0/1 (e.g. HadParticipated) which I inserted to the match controls.. is that right..?
The difference I got for one of them is as following:
HadParticipated 0.8868 0.8679 0.3418 0.0189 0.0000 0.0189 1.0000

I'm not sure what is the best way of inserting those variables as controls to the match… any help..? Thanks!!!

Best Answer

These numbers may make sense given your dataset. The 0's for the treatment and control groups' means for CurrencyCAD, CurrencyEUR, CurrencyNZD, and HobbyPhotograpy should just mean that those levels are not present in the matched cohort.

From your post, I'm guessing matchit is creating the dummy coded level variables for you (like you did manually for HadParticipated). Is there a level of Currency and a level of Hobby that are not shown in your post? The means should be the proportion in that category for the matched cohort in a given arm, so those means need to sum to one over all the categories for a variable.

Related Solutions

Solved – How to interpret output of Match() function in R (for propensity score matching)

So the output is

Estimate... -0.349,
AI SE... 0.124,
T-stat... -2.827,
p.val... 0.005

You did the matching presumably because you'd like to interpret the difference in outcome for treatment and control as a causal effect, i.e. as the change in the dependent variable caused by treatment, and you don't necessarily trust a big regression with controls to work out for you (though you do trust that you've got all the causes of treatment assignment bundled into the propensity score model).

In your case I guess that the dependent variable is a probability. If so then the matching analysis says that that probability is 0.35 less due to treatment - so an absolute 0.35 because you're computing a difference. This difference is computed after your data set is matched, pruned, etc. as well as it can to balance covariates over treatment and control cases. Actually you'd want to check that balance using other functions in the package before just trusting the summary output.

You have a lot of control over what 'good matching' means, though you've gone with the defaults which are, I believe to calculate an average treatment effect (ATE), not use calipers, etc. You can see the defaults on the relevant help page. So that's the Estimate here.

The AI SE is a matching corrected standard error due to Abadie and Imbens (hence the name AI). The t-stat and p.value are interpretable as usual, though corrected with that standard error. The details of AI standard errors you can find in A and I's original paper.

R – Matching with Multiple Treatments

I recommend taking a look at Lopez & Gutman (2017), who clearly describe the issues at hand and the methods used to solve them.

Based on your description, it sounds like you want the average treatment effect in the control group (ATC) for several treatments. For each treatment level, this answers the question, "For those who received the control, what would their improvement have been had they received treatment A?" We can, in a straightforward way, ask this about all of our treatment groups.

Note this differs from the usual estimand in matching, which is the average treatment effect in the treated (ATT), which answers the question "For those who received treatment, what would their decline had been had they received the control?" This question establishes that for those who received treatment, treatment was effective. The question the ATC answers is about what would happen if we were to give the treatment to those who normally wouldn't take it.

A third question you could ask is "For everyone, what would be the effect of treatment A vs. control?" This as an average treatment effect in the population (ATE) question, and is usually the question we want to answer in a randomized trial. It's very important to know which question you want to answer because each requires a different method. I'll carry on assuming you want the ATC for each treatment.

To get the ATC using matching, you can just perform standard matching between the control and each treatment group. This requires that you keep the control group intact (i.e., no adjustment for common support or caliper). One treatment group at a time, you find the treated individuals that are similar to the control group. After doing this for each treatment group, you can use regression in the aggregate matched sample to estimate the effects of each treatment vs. control on the outcome. To make this straightforward, simply make the control group the reference category of the treatment factor in the regression.

Here's how you might do this in MatchIt:

library(MatchIt)
treatments <- levels(data$treat) #Levels of treatment variable
control <- "control" #Name of control level
data$match.weights <- 1 #Initialize matching weights

for (i in treatments[treatments != control]) {
  d <- data[data$treat %in% c(i, control),] #Subset just the control and 1 treatment
  d$treat_i <- as.numeric(d$treat != i) #Create new binary treatment variable
  m <- matchit(treat_i ~ cov1 + cov2 + cov3, data = d)
  data[names(m$weights), "match.weights"] <- m$weights[names(m$weights)] #Assign matching weights
}

#Check balance using cobalt
library(cobalt)
bal.tab(treat ~ cov1 + cov2 + cov3, data = data, 
        weights = "match.weights", method = "matching", 
        focal = control, which.treat = .all)

#Estimate treatment effects
summary(glm(outcome ~ relevel(treat, control), 
            data = data[data$match.weights > 0,], 
            weights = match.weights))

It's a lot easier to do this using weighting instead of matching. The same assumptions and interpretations of the estimands apply. Using WeightIt, you can simply run

library(WeightIt)
w.out <- weightit(treat ~ cov1 + cov2 + cov3, data = data, focal = "control", estimand = "ATT")

#Check balance
bal.tab(w.out, which.treat = .all)

#Estimate treatment effects (using jtools to get robust SEs)
#(Can also use survey package)
library(jtools)
summ(glm(outcome ~ relevel(treat, "control"), data = data,
         weights = w.out$weights), robust = "HC1")

To get the ATE, you need to use weighting. In the code above, simple replace estimand = "ATT" with estimand = "ATE" and remove focal = "control". Take a look at the WeightIt documentation for more options. In particular, you can set method = "gbm", which will give you the same results as using twang. Note that I'm the author of both cobalt and WeightIt.

Lopez, M. J., & Gutman, R. (2017). Estimation of Causal Effects with Multiple Treatments: A Review and New Ideas. Statistical Science, 32(3), 432–454. https://doi.org/10.1214/17-STS612

Best Answer

Related Solutions

Solved – How to interpret output of Match() function in R (for propensity score matching)

R – Matching with Multiple Treatments

Related Question