Correlation without Causation

causalitycorrelationdag

I know the famous expression "correlation does not imply causation". In a DAG, this situation might look like

$$
X \leftarrow U \rightarrow Y
$$

Here even though $X$ and $Y$ are not causally related, the presence of confounder $U$ induces a correlation between them.

I also know that two variables that are causally related can be uncorrelated, as correlation is a linear measure of association. For example, the correlation between $X$ and $Y$ with $Y = X^2$ is $0$

In the context of counterfactual formal causal reasoning, my question is: If there is no unblocked path between $X$ and $Y$, would be ever expect there to be non-zero correlation between $X$ and $Y$ in the infinite sample limit? I know that in finite samples, spurious correlations can appear simply due to chance, but asymptotically, is it possible that if there is no open causal path between two variables, that we can expect a non-zero correlation, or really, any measure of association to be positive, between them?

In short: can two d-separated variables have an expected non-zero correlation? Answers should use counterfactual causal reasoning formalisms.

Best Answer

No.

With the caveat that the direct causal relationships embedded in a DAG are beliefs (or at least presuppositions of belief), so that the counterfactual formal causal analysis one performs is predicated on the DAG being true, then your question gets at the utility of this kind of reasoning, because in this worldview correlations can only be interpreted causally given the d-separation of the path from one variable to another. If a set of variables (say, $L$) is sufficient to d-separate the path from $A$ to $Y$ (say, $Y$ as putative effect, and $A$ as putative cause of $Y$), then:

one infers a $\text{cor}(Y,A|L) \ne 0$ as evidence that $A$ causes $Y$ (this is nonstandard notation… the folks I am familiar with would more typically write something like $P(Y=1|A=0,L) - P(Y=1|A=1,L) \ne 0$ for levels of $L$ instead of speaking specifically of correlation… likely because DAGs and the inferences drawn from them are nonparametric, but Pearson's correlation is linear, and Spearman's is monotonic), and
one infers $\text{cor}(Y,A|L) = 0$ as evidence that $A$ does not cause $Y$.

That is the point of this kind of causal analysis. (And is also why it offers value by directing critique of an analysis specifically to the construction of $L$ and the DAG.)

Except, kinda yes (but still no).

Back to the caveat about DAGs embodying beliefs. Those beliefs may be more or less valid for any given analysis. In fact, the DAG you provide indicates a good reason why: most variables we might imagine (whether fitting into $L$, $Y$, or $A$ in my nomenclature above) are themselves caused by some other variable… likely a variable in the set of unmeasured prior causes $U$. This is why the validity of causal inferences from observation studies are always subject to threats from unmeasured backdoor confounding (i.e. this quality is part of what we mean by 'observational study'), and why randomized control trials have a special kind of value (even though causal inferences from randomized control trials are just as subject to threats from selection bias as observational study designs).

Many great examples of correlations existing between 'causally unrelated' variables and processes are provided in links in comments to Mir Henglin's question. I would argue that rather than falsifying my unqualified "No." at the start of my answer, these indicate merely that the DAG has not actually been expanded to cover all the causal variables at play: the set of causal beliefs is incomplete (for example, see Pearl's point about incorporating hidden variables into the DAG). @whuber also made an important comment along these lines:

The whole point is that literally any two processes, even when completely independent of each other (causally and probabilistically), that undergo similar deterministic changes over time, will have non-zero correlations. If that's what you mean by "confounding," then so be it—but there doesn't seem to be a new question involved.

There are competing interpretations about the appropriateness of time as a causal variable in counterfactual formal causal reasoning. I will point out that:

DAG formalisms are explicit only about the qualitative temporal ordering of variables but
DAGs are otherwise silent about quantitative lengths of time.

So there is a case to be made that lengths of time can serve as a confounding variable in counterfactual formal causal reasoning.

The upshot is to repeat my opening caveat: conditional on a DAG being true, then if a path from $A$ to $Y$ is d-separated, then $A$ cannot cause $Y$ if $\text{cor}(Y,A|L) = 0$.

Best Answer

Related Solutions

Solved – Misunderstandings of “spurious correlation”

Solved – Difference between rungs two and three in the Ladder of Causation

Related Question