Solved – Misunderstandings of “spurious correlation”

correlationspurious-correlation

I've heard people use the term spurious correlation in so many different instances and various ways, that I'm getting confused. Moreover, the Wikipedia page for Spurious relationship states:

“In statistics, a spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are
not causally related to each other (i.e. they are independent),
yet it may be wrongly inferred that they are, due to either
coincidence or the presence of a certain third, unseen factor”

A couple of observations:

  1. Mathematically speaking, two variables $A, B$ are independent $\iff P(A | B) = P(A)$, correct?

    Clearly, if two variables are correlated, even if the dependency is driven by some third factor, the two are still not independent, like the Wikipedia article claims. What's up with that?

  2. If the “spurious” correlation is statistically significant (or not a result of coincidence), then what's wrong with that? I've seen people jumping out like rabid animals, foam coming out of their mouth screaming: “Spurious! Spurious!”.

    I don't understand why they do it — no one is claiming that there is a causal link between the variables. Correlation can exist without causation, so why label it “spurious”, which is sort of equivalent to calling it “fake”?

Best Answer

I've always hated the term "spurious correlation" because it is not the correlation that is spurious, but the inference of an underlying (false) causal relationship. So-called "spurious correlation" arises when there is evidence of correlation between variables, but the correlation does not reflect a causal effect from one variable to the other. If it were up to me, this would be called "spurious inference of cause", which is how I think of it. So you're right: people shouldn't foam at the mouth over the mere fact that statistical tests can detect correlation, especially if there is no assertion of an underlying cause. (Unfortunately, just as people often confuse correlation and cause, some people also confuse the assertion of correlation as an implicit assertion of cause, and then object to this as spurious!)

To understand explanations of this topic, and avoid interpretive errors, you also have to be careful with your interpretation, and bear in mind the difference between statistical independence and causal independence. In the Wikipedia quote in your question, they are (implicitly) referring to causal independence, not statistical independence (the latter is the one where $\mathbb{P}(A|B) = \mathbb{P}(A)$). The Wikipedia explanation could be tightened up by being more explicit about the difference, but it is worth interpreting it in a way that allows for the dual meanings of "independence".

Related Question