Solved – Explain in layperson’s terms why predictive models aren’t causally interpretable

causalityeconometricsinstrumental-variablesintuitionteaching

Imagine that you are asked to infer some causal effect — a change in an outcome $y$ in response to some variable $x$. But, the person asking for this directs you to use a predictive model to do so. Here's the setup:

  • $x$ is confounded inasmuch as there is some unobserved $u$ that is causally linked to both $y$ and $x$. We have a classical omitted variables bias.
  • We have high dimensional covariates $\mathbf{Z}$ that are not independent of $y$ or $x$ and/or $u$
  • You are asked to train a suite of predictive models — neural networks, boosted trees, whatever — denoted $g_i([x, \mathbf{Z}]) + \epsilon$ where $i$ indexes different models, and then select among them model $i$ that minimizes some metric of predictive skill. RMSE, for instance.
  • Based on the chosen model, you are asked to report
    $$
    \frac{\partial \hat{y}}{\partial x} = \frac{\partial \hat{g}_i([x, \mathbf{Z}])}{\partial x}
    $$
  • You know that
    $$
    E\left[\frac{\partial \hat{y}}{\partial x}\right] \neq \frac{\partial y}{\partial x}
    $$

    in the population, because the error term includes the omitted variable, so therefore
    $$
    \frac{\partial \epsilon}{\partial x} \neq 0 \text{ in the population, despite the fact that } \frac{\partial \hat\epsilon}{\partial x} = 0
    $$

    in any reasonable model $g$.

On top of omitted variables bias, there may be bias from regularization too!

  • Further assume that you have some causal model — say an instrumental variables regression, utilizing some suitable instrument $w$ for $x$. It's one of the models in your suite of models, but its predictive skill in terms of cross-validated RMSE is worse than the others.

The best model is the one that produces the consistent causal estimate, right? But:

How would you explain this to someone in layperson's terms?

The person asking for analysis doesn't understand causal inference, and needs to be educated. However, they don't understand math and have little attention span. How can you effectively convey the basic point that causal methods are required, and predictive methods are inappropriate? No math, lots of stories, pithy sentences.

Best Answer

First of all, I don't think this should be treated as a strict dichotomy: "predictive models can never establish causal inference." There are various situations in which a predictive model gives us "pretty darn good" confidence that a given causal relationship exists. So what I'd say is that predictive models - no matter how sophisticated - are often insufficient to establish causality with a high degree of confidence. Now, how to explain this to people who don't know stats/math at all?

Here's one approach:

You've heard it said that "correlation is not causation." What that means is just that just because two variables (call them A and B) are correlated, that doesn't mean one causes the other. This can happen when the correlation is due to a third "confounding" variable that is correlated with both A and B. For example: just because having a college degree is correlated with high earnings as an adult doesn't mean that getting a degree CAUSED those earnings to go up - it could be that "having rich parents" both allows people to get a degree and then separately helps them earn more (even if going to college actually does nothing).

Predictive models try to account for this problem by statistically "controlling for" confounding variables. So in the above case we could use statistical modeling to analyze the relationship between a degree and earnings after accounting for the fact that people with rich parents are more likely to have a degree.

Unfortunately, it's never possible in practice to control for EVERY confounding variable. This is partly because important variables (like the student's "personal motivation") may not exist or be impossible to measure. Even controlling for "parents being rich" is tricky - what single number could perfectly capture a family's entire economic situation? And how can we be sure that the data we have are accurate? Do any of us know PRECISELY how "rich" our parents were when we were growing up?

Another problem with predictive models is that even if you COULD control for everything they can't distinguish between A causing B or B causing A. So if we were trying to analyze the effect of depression on opiate use, no matter what control variables we include we can't be sure that the effect we observe isn't just due to opiate use CAUSING depression. Note that this is probably NOT a problem for our earlier example because it's impossible for your earnings as an adult to CAUSE you to have gone to college earlier in your life. This is one way in which our theoretical understanding of how these variables work helps us to understand the threats to causal inference.

The only way to completely ensure that a relationship between A and B is causal is to experimentally control how people get "assigned" to different values of A (e.g. to get a college degree or not). If assignment to A is completely random then we can be sure that NOTHING else influenced A, which means that you don't have to worry about ANY confounding variables (even B) in analyzing the relationship between A and B. However, for reasons that are obvious when we're considering college degrees, random assignment is often infeasible or downright unethical. So we have to use other research design approaches to approximate the causal power of random assignment. Critically, these other approaches (instrumental variables, regression discontinuity, natural experiments) rely on the features of the world itself, and the data collection process, rather than statistical/mathematical methods, to address issues of confounding variables.