Machine Learning – How Causal Trees Optimize for Heterogeneous Treatment Effects

causalitymachine learningmathematical-statistics

I have a very specific question regarding how the causal tree in the causal forest/generalized random forest optimizes for heterogeneity in treatment effects.

This question comes from the Athey & Imbens (2016) paper "Recursive partitioning for heterogeneous causal effects" from PNAS. Another paper is Wager & Athey (2018), "Estimation and inference of heterogeneous treatment effects using random forests" in JASA (arxiv.org link here). I know that the answer to my question is in those papers, but I, unfortunately, can't parse some of the equations to extract it. I know I understand an algorithm well when I can express it in words, so it has been irking me that I can't do so here.

In my understanding, an honest causal tree is generally constructed by:

Given a dataset with an outcome $Y$, covariates $X$, and a randomized condition $W$ that takes on the value of 0 for control and 1 for treatment:

  1. Split the data into subsample $I$ and subsample $J$

  2. Train a decision tree on subsample $I$ predicting $Y$ from $X$, with the requirement that each terminal node has at least $k$ observations from each condition in subsample $J$

  3. Apply the decision tree constructed on subsample $I$ to subsample $J$

  4. At each terminal node, get the mean of predictions for the $W$ = 1 cases from subsample $J$ and subtract the mean of predictions for the $W$ = 0 cases from subsample $J$; the resulting difference is the estimated treatment effect

Any future, out-of-sample cases (such as those used after deploying the model) will be dropped down the tree and assigned the predicted treatment effect for the node in which they end are placed.

This is called "honest," because the actual training and estimation are done on completely different data. Athey and colleagues have a nice asymptotic theory showing that you can derive variance estimates for these treatment effects, which is part of the motivation behind making them "honest."

This is then applied to a causal random forest by using bagging or bootstrapping.


Now, Athey & Imbens (2016) note that this procedure uses a modified mean squared error criterion for splitting, which rewards "a partition for finding strong heterogeneity in treatment effects and penalize a partition that creates variance in leaf estimates" (p. 7357).

My question is: Can you explain how this is the case, using words?

In the previous two sections before this quotation, Modifying Conventional CART for Treatment Effects and Modifying the Honest Approach, the authors use the Rubin causal model/potential outcomes framework to derive an estimation for the treatment effect.

They note that we are not trying to predict $Y$—like in most machine learning cases—but the difference between the expectation of $Y$ in two conditions, given some covariates $X$. In line with the potential outcomes framework, this is "infeasible": We can only measure the outcome of someone in one of the two conditions.

In a series of equations, they show how we can use a modified splitting criterion that predicts the treatment effect. They say: "…the treatment effect analog is infeasible, but we can use an unbiased estimate of it, which leads to $-\hat{MSE}_{\tau}(S^{tr, cv}, S^{tr, tr}, \Pi)$" (p. 7357). As someone who has a background in social science and applied statistics, I can't connect the dots between what they have set up and how we can estimate it from the data. How does someone calculate $-\hat{MSE}_{\tau}(S^{tr, cv}, S^{tr, tr}, \Pi)$ from observed data? What is the equation for it?

Any help at explaining how this criterion maximizes the variance in treatment effects (i.e., the heterogeneity of causal effects) OR any correction on my description of how to build a causal tree that might be leading to my confusion would be greatly appreciated. Right now, I don't see how this approach differs from other algorithms that just train on $Y$ and estimate CATEs with $E(Y | T = 1, X) – E(Y | T = 0, X)$.

Best Answer

Your understanding is correct, the core notion of the paper is that sampling-splitting is essential for empirical work and that it allows us to have an unbiased estimate of the treatment effect.

To tackle your main question: The criteria of choice are $\hat{EMSE}_\tau$ and $\hat{EMSE}_\mu$. Both penalise variance and well as encourage heterogeneity. For starters, I will focus on the estimated expected MSE of the treatment effect $\hat{EMSE}_\tau$. For a given tree/partition $\Pi$ when using a training sample $\mathcal{S}^{tr}$ and an estimation sample of size $N^{est}$, the estimator for the otherwise "infeasible criterion" $-\hat{EMSE}_\tau ( \mathcal{S}^{tr},N^{est},\Pi)$ is by definition the variance of the estimated treatment effect across leaves (the term denoted as: $\frac{1}{N^{tr}} \Sigma_{i \in \mathcal{S}^{tr}} \hat{\tau}^2 (X_i; \mathcal{S}^{tr}, \Pi)$) minus the uncertainty about these treatments effects (the variance estimators terms $S^2_{S^{tr}_{treat}}$ and $S^2_{S^{tr}_{control}}$ that are also inversely proportional to the sample sizes $N^{tr}$ and $N^{est}$). Therefore the goodness of fit is not a "vanilla" MSE but rather a variance-penalised one. The stronger the heterogeneity in our estimate the better our $EMSE_\tau$ and similarly the higher the variance of our estimates the worse our $EMSE_\tau$. Note also that the estimated average causal effect $\hat{\tau}(x; \mathcal{S}, \Pi)$ is equal to $\hat{\mu}(1,x; \mathcal{S}, \Pi ) - \hat{\mu}(0,x; \mathcal{S}, \Pi )$ i.e. we will reward the heterogeneity indirectly during the estimation of $\hat{\mu}$ too.

More generally, the basic idea with sample-splitting is that we are getting our estimates for a tree by using a separate sample from the sample that was used to construct the tree (i.e. a partition of the existing sample space $\mathcal{S}$) and thus we can focus mostly on the variance rather than on the bias-variance trade-off. This is the gist of the section Honest Splitting where we can see that the criteria of choice will penalise small leaf size exactly because they will be associated with high variance $S^2$ of the estimated effects.

In conclusion, the task of making a RF consistent is attacked from two sides:

  1. The sample is split to training and evaluation sets.
  2. The criterion for splitting is such that tree leafs are "big".

As mentioned through the paper, this will induce a hit in terms of MSE of the treatment effects but that will come to the increase of the nominal coverage for their confidence interval. I think Prof. Athey's quote from her 2016 presentation on Solving Heterogeneous Estimating Equations Using Forest Based Algorithms (21:25 to 22:02) captures the essense of this work nicely : "... people have said, if you're going to do hypothesis testing on treatment effects within leaves, shouldn't your objective function somehow anticipate you wanted to construct a confidence interval. (...) So we basically, instead of doing nearest neighbors like this" (using an adaptive $k$-NN estimator), "we're going to have tree based neighborhoods that basically slice up the covariate space according to where we see heterogeneity in the tree building sample. And then in the estimation sample, we'll come back and estimate treatment effects in that partition."

Related Question