I'll try to break it down a bit.. I think most of the confusion when studying potential outcomes (ie $Y_0,Y_1$) is to realize that $Y_0,Y_1$ are different than $Y$ without bringing in the covariate $X$. The key is to realize that every individual $i$ has potential outcomes $(Y_{i1},Y_{i0})$, but you only observe $Y_{iT}$ in the data.
Ignorability says
$$(Y_0,Y_1) \perp \!\!\! \perp T|X$$
which says that conditional on $X$, then the potential outcomes are independent of treatment $T$. It is not saying that $Y$ is independent of $T$. As you point out, that makes no sense. In fact, a classic way to re-write $Y$ is as
$$Y = Y_1T + Y_0(1-T)$$
which tells us that for every individual, we observe $Y_i$ which is either $Y_{i1}$ or $Y_{i0}$ depending on the value of treatment $T_i$. The reason for potential outcomes is that we want to know the effect $Y_{i1} - Y_{i0}$ but only observe one of the two objects for everyone. The question is: what would have $Y_{i0}$ been for the individuals $i$ who have $T_i=1$ (and vice versa)? Ignoring the conditional on $X$ part, the ignorability assumption essentially says that treatment $T$ can certainly affect $Y$ by virtue of $Y$ being equal to $Y_1$ or $Y_0$, but that $T$ is unrelated to the values of $Y_0,Y_1$ themselves.
To motivate this, consider a simple example where we have only two types of people: weak people and strong people. Let treatment $T$ be receiving medication, and $Y$ is health of patient (higher $Y$ means healthier). Strong people are far healthier than weak people. Now suppose that receiving medication makes everyone healthier by a fixed amount.
First case: suppose that only unhealthy people seek out medication. Then those with $T=1$ will be mostly the weak people, since they are the unhealthy people, and those with $T=0$ will be mostly strong people. But then ignorability fails, since the values of $(Y_1,Y_0)$ are related to treatment status $T$: in this case, both $Y_1$ and $Y_0$ will be lower for $T=1$ than for $T=0$ since $T=1$ is filled with mostly weak people and we stated that weak people just are less healthy overall.
Second case: suppose that we randomly assign medication to our pool of strong and weak people. Here, ignorability holds, since $(Y_1,Y_0)$ are independent of treatment status $T$: weak and strong people are equally likely to receive treatment, so the values of $Y_1$ and $Y_0$ are on average the same for $T=0$ and $T=1$. However, since $T$ makes everyone healthier, clearly $Y$ is not independent of $T$.. it has a fixed effect on health in my example!
In other words, ignorability allows that $T$ directly affects whether you receive $Y_1$ or $Y_0$, but treatment status is not related to the these values. In this case, we can figure out what $Y_0$ would have been for those who get treatment by looking at the effect of those who didn't get treatment! We get a treatment effect by comparing those who get treatment to those who don't, but we need a way to make sure that those who get treatment are not fundamentally different from those who don't get treatment, and that's precisely what the ignorability condition assumes.
We can illustrate with two other examples:
A classic case where this holds is in randomized control trials (RCTs) where you randomly assign treatment to individuals. Then clearly those who get treatment may have a different outcome because treatment affects your outcome (unless treatment really has no effect on outcome), but those who get treatment are randomly selected and so treatment receival is independent of potential outcomes, and so you indeed do have that $(Y_0,Y_1) \perp \!\!\! \perp T$. Ignorability assumption holds.
For an example where this fails, consider treatment $T$ be an indicator for finishing high school or not, and let the outcome $Y$ be income in 10 years, and define $(Y_0,Y_1)$ as before. Then $(Y_0,Y_1)$ is not independent of $T$ since presumably the potential outcomes for those with $T=0$ are fundamentally different from those with $T=1$. Maybe people who finish high school have more perseverance than those who don't, or are from wealthier families, and these in turn imply that if we could have observed a world where individuals who finished high school had not finished it, their outcomes would still have been different than the observed pool of individuals who did not finish high school. As such, ignorability assumption likely does not hold: treatment is related to potential outcomes, and in this case, we may expect that $Y_0 | T_i = 1 > Y_0 | T_i = 0$.
The conditioning on $X$ part is simply for cases where ignorability holds conditional on some controls. In your example, it may be that treatment is independent of these potential outcomes only after conditioning on patient history. For an example where this may happen, suppose that individuals with higher patient history $X$ are both sicker and more likely to receive treatment $T$. Then without $X$, we run into the same problem as described as above: the unrealized $Y_0$ for those who receive treatment may be lower than the realized $Y_0$ for those who did not receive treatment because the former are more likely to be unhealthy individuals, and so comparing those with and without treatment will cause issues since we are not comparing the same people. However, if we control for patient history, we can instead assume that conditional on $X$, treatment assignment to individuals is again unrelated to their potential outcomes and so we are good to go again.
Edit
As a final note, based on chat with OP, it may be helpful to relate the potential outcomes framework to the DAG in OP's post (Noah's response covers a similar setting with more formality, so definitely also worth checking that out). In these type of DAGs, we fully model relationships between variables. Forgetting about $X$ for a it, suppose we just have that $T \rightarrow Y$. What does this mean? Well it means that the only effect of T is through $T = 1$ or $T= 0$, and through no other channels, so we immediately have that T affects $Y_1T+ Y_0(1-T)$ only through the value of $T$. You may think "well what if T affects Y through some other channel" but by saying $T \rightarrow Y$, we are saying there are no other channels.
Next, consider your case of $X \rightarrow T \rightarrow Y \leftarrow X$. Here, we have that T directly affects Y, but X also directly affects T and Y.
Why does ignorability fail? Because T can be 1 through the effect of X, which will also affect Y, and so $T = 1$ could affect $Y_0$ and $Y_1$ for the group where $T=1$, and so T affects $Y_1T + Y_0(1-T)$ both through 1. the direct effect of the value of T, but 2. T now also affects $Y_1$ and $Y_0$ through the fact that $X$ affects $Y$ and $T$ at the same time.
As I see it, there are two related reasons to consider matching instead of regression. The first is assumptions about functional form, and the second is about proving to your audience that functional form assumptions do not affect the resulting effect estimate. The first is a statistical matter and the second is epistemic. Consider the tale below that attempts to illustrate how the choice between matching and regression could play out.
We'll assume you have measured a sufficient adjustment set to satisfy the backdoor criterion (i.e., all relevant confounders have been measured) with no measurement error or missing data, and that your goal is to estimate the marginal treatment effect of the treatment on an outcome. We'll also assume the standard assumptions of positivity and SUTVA hold. We'll consider a continuous outcome first, but much of the discussion extends to general outcomes.
Part 1: Regression
You decide to run a regression of the outcome on the treatment and confounders as a way to control for confounding by these variables because that is what linear regression is supposed to do. However, the effect estimate is only unbiased under extremely strict circumstances. First, that the treatment effect is constant across levels of the confounders, and second, that the linear model describes the conditional relationship between the outcome and the confounders. For the first, you might include an interaction between the treatment and each confounder, allowing for heterogeneous treatment effects while estimating the marginal effect. This is equivalent to g-computation (1), which involves using the fitted regression model to generate predicted values under treatment and control for all units and using the difference in the means of these predicted values as the effect estimate.
That still assumes a linear model for the outcomes under treatment and control. Okay, we'll use a flexible machine-learning method like random forests instead. Well, now we can't claim our estimator is unbiased, only possibly consistent, and it still requires the specific machine learning model to approach the truth at a certain rate. Okay, we'll use Superlearner (2), a stacking method that takes on the rate of convergence of the fastest of its included models. Well, now we don't have a way to conduct inference, and the model might still be wrong. Okay, we'll use a semiparametric efficient doubly-robust estimator like augmented inverse probability weighting (AIPW) (3) or targeted minimum loss-based estimation (TMLE) (4). Well, that's only consistent if the true models fall in the Donsker class of models. Okay, we'll use cross-fitting with AIPW or TMLE to relax that requirement (5).
Great. You've taken regression to its extreme, relaxing as many assumptions as possible and landing with a multiply-robust estimator (multiply-robust in the sense that if one of many models are correct, the estimator is consistent) with generally good inference properties (but it can be bootstrapped so getting the variance exactly right isn't a big problem). Have we solved causal inference?
You submit the results of your cross-fit TMLE estimate using Superlearner for the propensity score and potential outcome models with a full library including highly adaptive lasso and many other models, which, under weak assumptions, are all that are required for a truly consistent estimator that converges at a parametric rate.
A reviewer reads the paper and says, "I don't believe the results of this model."
"Why not?" you say. "I used the optimal estimator with the best properties; it is consistent and semiparametric efficient with few, if any, assumptions on the functional forms of the models."
"Your estimator is consistent," says the reviewer, "but not unbiased. That means I can only trust its results in general and as N goes to infinity. How do I know you have successfully eliminated bias in the effect estimate in this dataset?"
"..."
Part 2: Matching to the Rescue
You read about a hot new method called "propensity score matching" (6). It was big in 1983, and, even in 2021, you see it in almost every paper published in specialized medical journals. You come across King and Nielsen's influential paper "Why Propensity Scores Should Not Be Used for Matching" (7) and Noah's answer on CV describing the many drawbacks to using propensity score matching. Okay, you'll use genetic matching instead (8), and minimize the energy distance between the samples (9), including a flexibly estimated propensity score as a covariate to match on. You find that balance can be improved by using substantive knowledge to incorporate exact matching and caliper constraints that prioritize balance on covariates known to be important to the outcome. You decide to use full matching to relax the requirement of 1:1 matching to include more units in the analysis (10).
You estimate the treatment effect using a simple linear regression of the outcome on the treatment and the covariates, including the matching weights in the regression and using a cluster-robust standard error to account for pair membership (11). You resubmit the result of your full matching analysis using exact matching and calipers for prognostically important variables and a distance matrix estimating using genetic matching on the covariates and a flexibly estimated propensity score.
The reviewer reads your new manuscript. "Wow, you've learned a lot. But I still don't believe you've removed bias the in the effect estimate."
"Look at the balance tables," you say. "The covariate distributions are almost identical."
"I see low standardized mean differences," says the reviewer, "but imbalances could remain on other features of the covariate distribution."
"Look at the balance tables in the appendix which contain balance statistics for pairwise interactions, polynomials up to the 5th power of each covariate, and Kolmogorov-Smirnov statistics to compare the full covariate distributions. There are no meaningful differences between the samples, and no differences at all on the most highly prognostic covariates because of the exact matching constraints and calipers."
"I see..."
"Also, I used Branson's randomization test (12) with the energy distance as the balance statistic to show that my sample is better balanced not only than a hypothetical randomized trial using the same data, but also a block randomized trial, and even a covariate balance-constrained randomized trial."
"Wow, I guess I don't have much to say..."
"My outcome regression estimator isn't just consistent, it's truly unbiased in this sample. Also, because I incorporated pair membership into the analysis, my standard errors are smaller and more accurate and the resulting estimate is less sensitive to unobserved confounding* (13)."
"I get it!"
Part 3: The criticism
Frank Harrell bursts into the room. "Wait, by discarding so many units in matching, you have thrown away so much useful data and needlessly decimated your precision." Mark van der Laan follows. "Wait, by using substantive 'expertise' you are not letting the analysis method find the true patterns in the data that might have eluded researchers, and your estimator does not converge at a known rate, let alone a parametric one! And there is no guarantee that your inference is valid!" I, your humble narrator, too, join in on the dogpile. "Wait, by using exact matching constraints and calipers, you have shifted your estimand away from the ATE or any a priori describable estimand (14)! Your effect estimate may be unbiased, but unbiased for what?"
You stand there, bewildered, defeated, feeling like you have come nowhere since you asked your simple question on CrossValidated what felt like years ago, no closer to understanding whether you should use matching or regression to estimate causal effects.
The curtains close.
Part 4: Epilogue
In the face of uncertainty and scarcity, we are left with tradeoffs. The choice between a regression-based method and matching to estimate a causal effect depends on how you and your audience choose to manage those tradeoffs and prioritize the advantages and drawbacks of each method.
Standard regression requires strong functional form assumptions, but with advanced methods, those can be relaxed, at the cost of giving up on bias and focusing on consistency and asymptotic inference. Many of these advanced methods work best in large samples, and they still require many choices along the way (e.g., which specific estimator to use, which machine learning methods to include in the Superlearner library, how many folds to use for cross-validation and cross-fitting, etc.). Although the multiply-robust methods may guarantee consistency and fast convergence rates in general data, it is not immediately clear how you can assess how well they eliminated bias in your dataset, potentially leaving one skeptical of their actual performance in your one instance.
Matching methods require few functional form assumptions because no models are required (e.g., when using a distance matrix that doesn't depend solely on the propensity score, like that resulting from genetic matching). You can control confounding by adjusting the specification of the match, focusing more effort on hard-to-balance or prognostically important variables. You can come close to guaranteeing unbiasedness by ensuring you have achieved covariate balance, which can and should be measured extremely broadly with a skeptic in mind. You can use tools for analyzing randomized trials and trials with more powerful and robust designs. This comes at the cost of possibly decimating your precision by discarding huge amounts of data, changing your estimand so that your effect estimate doesn't generalize to a meaningful population and isn't replicable, and relying on ad hoc, "artisanal" methods with no clear path for valid inference.
The advantage matching has over regression, and the reason why I think it is so valuable and why I devoted my graduate training to understanding and improving matching and its use by applied researchers as the author of the R package cobalt
, WeightIt
, MatchIt
, and others, is an epistemic advantage. With matching, you can more effectively convince a reader that what you have done is trustworthy and that you have accounted for all possible objections to the observed result, and can at least point to specific assumptions and explain how their violation might affect results. This all centers on covariate balance, the similarity between covariate distributions across the treatment groups. By reporting balance broadly and submitting the resulting matched data to a battery of tests and balance measures, you can convince yourself and your readers that the resulting effect estimate is unbiased and therefore trustworthy (given the assumptions mentioned at the beginning, though these may be tenuous, and neither matching nor regression can solve that problem).
However, not everyone agrees that this advantage so important, or more important than consistency and valid asymptotic inference. There can never be consensus on this matter, because consensus requires knowing the truth, and science (including statistics research) is about searching for an inherently unknowable truth (i.e., the true parameters that govern or describe our world). That is, if we knew the true causal effect, we could know the best method to estimate it, but we don't, so we can't. We can only do our best using the knowledge we have and try to manage the inherent constraints and tradeoffs as well as we can as we fumble around in the dark using the pinpoint of light the universe has shown us.
*Only when using a special method of inference for matched samples.
- Snowden JM, Rose S, Mortimer KM. Implementation of G-Computation on a Simulated Data Set: Demonstration of a Causal Inference Technique. Am J Epidemiol. 2011;173(7):731–738.
- van der Laan MJ, Polley EC, Hubbard AE. Super Learner. Statistical Applications in Genetics and Molecular Biology [electronic article]. 2007;6(1). (https://www.degruyter.com/view/j/sagmb.2007.6.issue-1/sagmb.2007.6.1.1309/sagmb.2007.6.1.1309.xml). (Accessed October 8, 2019)
- Daniel RM. Double Robustness. In: Wiley StatsRef: Statistics Reference Online. American Cancer Society; 2018 (Accessed November 9, 2018):1–14.(http://onlinelibrary.wiley.com/doi/abs/10.1002/9781118445112.stat08068). (Accessed November 9, 2018)
- Gruber S, van der Laan MJ. Targeted Maximum Likelihood Estimation: A Gentle Introduction. 2009;17.
- Zivich PN, Breskin A. Machine Learning for Causal Inference: On the Use of Cross-fit Estimators. Epidemiology. 2021;32(3):393–401.
- Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55.
- King G, Nielsen R. Why Propensity Scores Should Not Be Used for Matching. Polit. Anal. 2019;1–20.
- Diamond A, Sekhon JS. Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics. 2013;95(3):932–945.
- Huling JD, Mak S. Energy Balancing of Covariate Distributions. arXiv:2004.13962 [stat] [electronic article]. 2020;(http://arxiv.org/abs/2004.13962). (Accessed December 22, 2020)
- Stuart EA, Green KM. Using full matching to estimate causal effects in nonexperimental studies: Examining the relationship between adolescent marijuana use and adult outcomes. Developmental Psychology. 2008;44(2):395–406.
- Abadie A, Spiess J. Robust Post-Matching Inference. Journal of the American Statistical Association. 2020;0(ja):1–37.
- Branson Z. Randomization Tests to Assess Covariate Balance When Designing and Analyzing Matched Datasets. Observational Studies. 2021;7:44–80.
- Zubizarreta JR, Paredes RD, Rosenbaum PR. Matching for balance, pairing for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit high schools in Chile. The Annals of Applied Statistics. 2014;8(1):204–231.
- Greifer N, Stuart EA. Choosing the Estimand When Matching or Weighting in Observational Studies. arXiv:2106.10577 [stat] [electronic article]. 2021;(http://arxiv.org/abs/2106.10577). (Accessed September 17, 2021)
Best Answer
The assumption of strong ignorability is that there exists a set of variables $W$, possibly a subset of all measured variables $V$, such that $Y^X \perp X \mid W$. It does not say that $Y^X \perp X \mid V$, i.e., that the potential outcomes are independent of treatment given all measured variables $V$. To meet this assumption, one has to find the set $W$. In your example, $W$ consists only of $L$ and does not include $Z$. So, strong ignorability is not met when using $W = \{L, Z\}$ but it is met when using $W = \{L\}$. We call $W$ a "sufficient adjustment set". Note that strong ignorability says nothing about how to construct or find $W$ or which sets of variables are allowable to satisfy the assumption. It is merely the assumption that there is a set of such variables. Estimators that rely on strong ignorability (and not all estimators of causal effects do) then use $W$ to estimate the effect.
The theory used to identify which variables form a sufficient adjustment set, i.e., which set of variables to include in $W$ to meet strong ignorability, is DAG theory. DAG theory says not to include $Z$ in $W$ precisely because doing so opens up a backdoor path; conditioning on a collider induces a non-causal association between the antecedents of the collider.
In an ideal observational study, a researcher should identify a set of variables to collect that form a sufficient adjustment set. In practice, researchers often have a dataset with many variables already measured, and they have to decide which ones among them belong in $W$, which do not belong in $W$ (e.g., colliders), and whether there is a set of variables that forms a sufficient adjustment set. My sense is that most studies do this to some extent, which is why you might not have to worry about studies being invalidated by conditioning on colliders. One way to avoid conditioning on colliders is to only condition on variables measured before treatment (i.e., "pre-treatment covariates"); these cannot be caused by either the treatment or the outcome and so are less likely to induce a violation of strong ignorability (though pre-treatment colliders of unmeasured pre-treatment causes of the treatment and outcome can still be a problem; this is called "M-bias").