What variables need to be controlled for in this causal graphical model

causal-diagramcausalitygraphical-model

I have the below graphical causal model. I thought that when we apply the intervention i.e. do calculus we get to the graph on the right – that is deleting arrows going into the treatment (drug). to be clear i want to see effect of drug on cancer.

enter image description here

However when i used R daggify and use the ggdag_adjustment_set it only highlights age as something to control for:
enter image description here

Why hasn't area been highlighted – is it because if i control for area it may lead to bias? What kind of variable is 'area' and do i need to control for it

Best Answer

In order to isolate a causal effect, we need the causal effect to be "identifiable."

At a high level, assuming binary variables here, a causal effect is identifiable if we can express the treatment effect that we care about — in this case $P(Cancer(Drug = 1)) - P(Cancer(Drug = 0))$ — in terms of quantities computable from our observed data.

There are a few conditions that need to be satisfied for our causal effect to be identifiable, but since you're asking about "what should I control for," the one that is most relevant is exchangeability/conditional exchangeability. Formally, for your setting, we'd express this as $Cancer(Drug) \perp Drug \mid L$ — conditioned on some set of confounders $L$, there is no dependence between the counterfactual value of "Cancer" and the observed treatment "Drug."

"The hard part" is determining "what goes in $L$." Luckily, the "backdoor criterion" exists to determine which variables you need to control for in a given causal DAG in order to achieve (conditional) exchangeability. This criterion states that, given a causal DAG, you need to "block" all "paths" between treatment and outcome that aren't the treatment -> outcome arrow denoting the effect you're trying to estimate.

You can think of a path in a DAG as a chain of arrows (ignoring the direction for now). To block a path, there needs to be either a "collider" ($\rightarrow X \leftarrow$, where $X$ is some placeholder variable) that we are not conditioning on (+ one other condition that I'll omit for simplicity), or we need to condition on a non-collider ($\rightarrow X \rightarrow$ or $\leftarrow X \rightarrow$).

If you apply these conditions to your DAG, you'll see that, to achieve conditional exchangeability, we need to block the path $Drug \leftarrow Age \rightarrow Cancer$. Since $Age$ is a non-collider, we need to condition on it. We do not need to condition on $Area$, since it does not lie on a path between $Cancer$ and $Drug$. There may settings/specific designs where you might condition on $Area$, but for identifying the causal effect of $Drug$ on $Cancer$, there is no need.

Further reading

My summary of the backdoor criterion is derived from these lecture notes — slides 27-48 — which give a further overview "what do I condition on."

For further details, I'd recommend reading the first 3 chapters (approximately) of What If? — it's a fairly approachable textbook on causal inference.

Related Question