Solved – If ‘correlation doesn’t imply causation’, then if I find a statistically significant correlation, how can I prove the causality

causalitycorrelationmathematical-statistics

I understand that correlation is not causation. Suppose we get high correlation between two variables. How do you check if this correlation is actually because of causation? Or,under what conditions, exactly, can we use experimental data to deduce a causal relationship between two or more variables?

Best Answer

A very likely reason for 2 variables being correlated is that their changes are linked to a third variable. Other likely reasons are chance (if you test enough non-correlated variables for correlation, some will show correlation), or very complex mechanisms that involve multiple steps.

See http://tylervigen.com/ for examples like this:

enter image description here

To confidently state causation of A -> B, you need an experiment where you can control variable A and do not influence the other variables. Then you measure if the correlation of A and B still exists if you change your variable.

For nearly all practical applications, it is almost not possible to not influence other (often unknown) variables as well, therefore the best we can do is to prove the absence of causation.

To be able to state a causal relationship, you start with the hypothesis that 2 variables have a causal relationship, use an experiment to disprove the hypothesis and if you fail, you can state with a degree of certainty that the hypothesis is true. How high your degree of certainty needs to be depends on your field of research.

In many fields it's common or necessary to run 2 parts of your experiment in parallel, one where the variable A is changed, and a control group where variable A isn't changed, but the experiment is otherwise exactly the same - e.g. in case of medicine you still stick subjects with a needle or make them swallow pills. If the experiment shows correlation between A and B, but not between A and B' (B of the control group), you can assume causation.

There are also other ways to conclude causality, if an experiment is either not possible, or inadvisable for various reasons (morals, ethics, PR, cost, time). One common way is to use deduction. Taking an example from a comment: to prove that smoking causes cancer in humans, we can use an experiment to prove that smoking causes cancer in mice, then prove that there is a correlation between smoking and cancer in humans, and deduce that therefore it's extremely likely that smoking causes cancer in humans - this proof can be strengthened if we also disprove that cancer causes smoking. Another way to conclude causality is the exclusion of other causes of the correlation, leaving the causality as the best remaining explanation of the correlation - this method is not always applicable, because it is sometimes impossible to eliminate all possible causes of the correlation (called "back-door paths" in another answer). In the smoking/cancer example, we could probably use this approach to prove that smoking is responsible for tar in the lungs, because there are not that many possible sources for that.

These other ways of "proving" causality are not always ideal from a scientific point of view, because they are not as conclusive as a simpler experiment. The global warming debate is a great example to show how it's a lot easier to dismiss causation that hasn't yet been proven conclusively with a repeatable experiment.

For comic relief, here's an example of an experiment that's technically plausible, but not advisable due to non-scientific reasons (morals, ethics, PR, cost):

Image taken from phroyd.tumblr.com