Solved – Why is post treatment bias a bias and not just multicollinearity

biasmulticollinearityobservational-studyregression

In this presentation by Gary King, he discusses post treatment bias as follows:

Post treatment bias occurs:

  • when controlling away for the consequences of treatment
  • when causal ordering among predictors is ambiguous or wrong

Example of avoidable post-treatment bias: Causal effect of Race on
Salary in a firm

  • DO control for qualifications
  • DON'T control for position in the firm

Example of unavoidable post-treatment bias: Causal effect of
democratization on civil war, do we control for GDP?

  • Yes, since GDP -> democratization, we must control to avoid omitted variable bias
  • No, since democratization -> GDP, we would have post treatment bias

I don't understand how this is a bias and not simply a problem of multi-collinearity? In the first example, if blacks tend to get low-ranking position, then yes, Race and Position are highly correlated and leads to high standard error. But why does it lead to bias?

Best Answer

First, let’s clear up a difference between the terms, and then discuss the respective problems that each causes.

Multi-collinearity refers to a problematic relationship between multiple right-hand-side variables (usually control variables) caused by their being highly correlated, regardless of causal ordering. Post-treatment bias refers to a problematic relationship between your treatment variable and at least one control variable, based on a hypothesized causal ordering. Furthermore, multi-collinearity and Post-treatment bias causes different problems if they are not avoided.

Multi-collinearity generally refers to a high correlation between multiple right-hand-side variables (usually two control variables) in a regression model, which is a problem. If a right-hand-side variable and your outcome variable were highly correlated (conditional on other right-hand-side variables), however, that would not necessarily be a problem; instead it would be suggestive of a strong relationship that might be of interest to the researcher.

Multi-collinearity between control variables does not impact the reliability of a model overall. We can still reliably interpret the coefficient and standard errors on our treatment variable. The negative side of multi-collinearity is that we can no longer interpret the coefficient and standard error on the highly correlated control variables. But if we are being strict in conceiving of our regression model as a notioanal experiment, where we want to estimate the effect of one treatment (T) on one outcome (Y), considering the other variables (X) in our model as controls (and not as estimable quantities of causal interest), then regressing on highly correlated variables is fine.

Another fact that may be thinking about is that if two variables are perfectly multicollinear, then one will be dropped from any regression model that includes them both.

For more, see: See http://en.wikipedia.org/wiki/Multicollinearity

Post-treatment bias occurs when the regression model includes a consequence of treatment as a control variable, regardless of how highly correlated the consequence-of-treatment control variable is with the treatment. Although generally the severity of post-treatment bias is increasing in the correlation between the treatment and the consequence-of-treatment control variable.

Post-treatment bias is a problem because one of your control variables will mathematically “soak up” some of the effect of your treatment, thus biasing your estimate of the treatment effect. That is, some of the variation in your outcome due to your treatment will be accounted for in the coefficient estimate on the consequence-of-treatment control variable. This is misleading because to estimate the full effect of treatment, you want all of the variation explained by the treatment to be included in the treatment variable's coefficient estimate.

As an example, we want to study the impact of race on salary. Imagine that race affects job position, which in turn affects salary, and the full effect of race on salary is due to the way that race changes people’s job position. That is, other than how race affects job position, there is no effect of race on salary. If we regressed salary on race and controlled for job position, we would (correctly, mathematically speaking) find no relationship between race and salary, conditional on job position.

To highlight how controlling for a consequence of treatment biases your treatment estimate, consider the difference between a researcher interested in the total effect of a treatment versus the direct effect of a treatment. If we want to study the total impact of race on salary we do not care how that effect is mediated. We care about all pathways linking race and salary. We do not want to control for any variable that mediates the effect of race on salary. If we care about only the direct effect of race on salary (although this research question smacks of pre-Darwinian scientific racism), we want to exclude any "mediated" effects from our treatment estimate. So we would want to control for job position, education, social networks, etc. These change the treatment estimate. If our goal is to estimate the direct effect, then control for the consequences of treatment. If our goal is to estimate the total effect, however, controlling for these consequences of treatment biases our treatment estimate.

For more intuition through example, refer to Gelman and Hill (2007) "Data Analysis Using Regression and Multilevel/Hierarchical Models," pages 188-192.