How to estimate causality or the causal effect between any two variables using any statistical techniques in Python

causalityinferencemachine learningmathematical-statisticspython

I am new to the idea of causal inference or causality in statistic and in Python.

I have a dataframe test which looks as follows:

x   y
0   0.03    315.98
1   -0.03   316.91
2   0.06    317.64
3   0.03    318.45
4   0.05    318.99
... ... ...
58  0.92    406.76
59  0.84    408.72
60  0.97    411.66
61  1.01    414.24
62  0.84    416.45

test.to_dict() is given as:

{'x': {0: 0.03,
  1: -0.03,
  2: 0.06,
  3: 0.03,
  4: 0.05,
  5: -0.2,
  6: -0.11,
  7: -0.06,
  8: -0.02,
  9: -0.08,
  10: 0.05,
  11: 0.02,
  12: -0.08,
  13: 0.01,
  14: 0.16,
  15: -0.07,
  16: -0.01,
  17: -0.1,
  18: 0.18,
  19: 0.07,
  20: 0.16,
  21: 0.26,
  22: 0.32,
  23: 0.14,
  24: 0.31,
  25: 0.16,
  26: 0.12,
  27: 0.18,
  28: 0.32,
  29: 0.39,
  30: 0.27,
  31: 0.45,
  32: 0.4,
  33: 0.22,
  34: 0.23,
  35: 0.31,
  36: 0.44,
  37: 0.33,
  38: 0.46,
  39: 0.61,
  40: 0.38,
  41: 0.39,
  42: 0.53,
  43: 0.62,
  44: 0.62,
  45: 0.53,
  46: 0.67,
  47: 0.63,
  48: 0.66,
  49: 0.54,
  50: 0.65,
  51: 0.72,
  52: 0.61,
  53: 0.64,
  54: 0.67,
  55: 0.74,
  56: 0.89,
  57: 1.01,
  58: 0.92,
  59: 0.84,
  60: 0.97,
  61: 1.01,
  62: 0.84},
 'y': {0: 315.98,
  1: 316.91,
  2: 317.64,
  3: 318.45,
  4: 318.99,
  5: 319.62,
  6: 320.04,
  7: 321.37,
  8: 322.18,
  9: 323.05,
  10: 324.62,
  11: 325.68,
  12: 326.32,
  13: 327.46,
  14: 329.68,
  15: 330.19,
  16: 331.12,
  17: 332.03,
  18: 333.84,
  19: 335.41,
  20: 336.84,
  21: 338.76,
  22: 340.12,
  23: 341.48,
  24: 343.15,
  25: 344.85,
  26: 346.35,
  27: 347.61,
  28: 349.31,
  29: 351.69,
  30: 353.2,
  31: 354.45,
  32: 355.7,
  33: 356.54,
  34: 357.21,
  35: 358.96,
  36: 360.97,
  37: 362.74,
  38: 363.88,
  39: 366.84,
  40: 368.54,
  41: 369.71,
  42: 371.32,
  43: 373.45,
  44: 375.98,
  45: 377.7,
  46: 379.98,
  47: 382.09,
  48: 384.02,
  49: 385.83,
  50: 387.64,
  51: 390.1,
  52: 391.85,
  53: 394.06,
  54: 396.74,
  55: 398.81,
  56: 401.01,
  57: 404.41,
  58: 406.76,
  59: 408.72,
  60: 411.66,
  61: 414.24,
  62: 416.45}}

There are two variables in this dataframe x and y. x is the independent variable, and y is the dependent variable.

I can calculate the correlation between two using:

test.corr()

It returned:

x   y
x   1.000000    0.961354
y   0.961354    1.000000

This means, that the correlation between x and y is 96%. However, this does not show the causal relationship between the two variables.

How can I show statistically in Python that x causes y and show the effect by certain value?

Best Answer

As Christoph and Cryo have mentioned, you are asking the impossible unless you have more information. Christoph is absolutely correct in saying that you would need to have run an experiment to get the data you have, or for some other reason, you would need to be confident that you have no confounding variables.

Formally, Theorem 1.2.8 (Observational Equivalence) on page 19 of Pearl's Causality: Models, Reasoning, and Inference, 2nd Ed., states the following:

Two DAGs are observationally equivalent if and only if they have the same skeletons and the same sets of $v$-structures, that is, two converging arrows whose tails are not connected by an arrow.

The skeleton refers to the nodes and undirected arrows. So two graphs would have the same skeleton if you crunched all their directed edges down to undirected edges and found you had the same graph. The $v$-structures are mostly to do with colliders.

In your case, you have only mentioned two variables, so you can't even have $v$-structures. It follows from the theorem, then, that you cannot use any data to distinguish between the two graphs $X\to Y$ and $Y\to X.$

There are algorithms to detect causal models, but they are subject to the fundamental limitation of this theorem. Pearl writes:

Observational equivalence places a limit on our ability to infer directionality from probabilities alone. Two networks that are observationally equivalent cannot be distinguished without resorting to manipulative experimentation or temporal information.