R – Conditional Independence Tests and D-Separation

bayesian networkcausalityd-separationindependencer

Wrong example, please refer to the second example

I just tried to model a Bayesian network composed of 3 variables as follows

$A\sim N(0,1)$

$B\sim A + N(0,1)$

$T\sim A + B + N(0,1)$

In the DAG associated to this experiment, $A$ represents a backdoor path from $B$ to $T$ therefore I expect that, by conditioning on $A$, the dependence between $B$ and $T$ decreases. However, in the simulated scenario (R code below) this does not seem to happen since the p-value when testing $B\perp T|\emptyset$ is lower than the p-value of the test $B\perp T|A$.

Any idea of why this happens? May that be because the probability distribution of the variables is not faithful to the DAG as pointed in the answer here?

library(bnlearn)
set.seed(120395)

A = rnorm(n = 100, mean = 0, sd = sqrt(1))
B = A + rnorm(n = 100, mean = 0, sd = sqrt(1))
T = A + B + rnorm(n = 100, mean = 0, sd = sqrt(1))

df <- data.frame(A, B, T)

t1 <- ci.test("B", "T", data = df, test = "cor")
t2 <- ci.test("B", "T", "A", data = df, test = "cor")
print(c(t1$p.value, t2$p.value))

Output:

6.70679e-37 1.66561e-20

New corrected example

Let us consider

$A\sim N(0,1)$

$B\sim A + N(0,1)$

$C\sim A + B + N(0,1)$

$D\sim A + B + C + N(0,1)$

$T\sim A + B + C + D + N(0,1)$

The DAG associated to this experiment is the following

Let us study association between $C$ and $T$. The d-separation criteria tells us that in this BN, without conditioning on any variable all the paths from $C$ to $T$ are open and I expect that, by closing some of them, the dependency between $C$ and $T$ decreases.
In this particular graph, we expect the dependence by conditioning on $\{A,D\}$ to be higher than the one obtained by conditioning on $\{A,B,D\}$ since the latter case blocks $T\leftarrow B \rightarrow C$ as well as the paths through $A$ and $D$. Putting this in formulas, we expect to see

$dep(C,T|\{A,D\}) > dep(C,T|\{A,B,D\})$

By using the negative p-value as a dependence measure (as suggested here) these assumptions are violated since the R code outputs

$dep(C,T|\{A,D\}) = -pvalue_{C\perp T|\{A,D\}} = -1.78*10^{-09}$

$dep(C,T|\{A,B,D\}) = -pvalue_{C\perp T|\{A,B,D\}}= -1.52*10^{-11}$

therefore

$dep(C,T|\{A,D\}) < dep(C,T|\{A,B,D\})$

Any idea why this happens? May that be that the p-values (and the negative p-values as well) are not suited for studying dependency comparisons as the one I did?

Here's the code for this example

library(bnlearn)
set.seed(120395)

A = rnorm(n = 100, mean = 0, sd = sqrt(1))
B = A + rnorm(n = 100, mean = 0, sd = sqrt(1))
C = A + B + rnorm(n = 100, mean = 0, sd = sqrt(1))
D = A + B + C + rnorm(n = 100, mean = 0, sd = sqrt(1))
T = A + B + C + D + rnorm(n = 100, mean = 0, sd = sqrt(1))

df <- data.frame(A, B, C, D, T)

t1 <- ci.test("C", "T", data = df, test = "cor")
t2 <- ci.test("C", "T", "A", data = df, test = "cor")
t3 <- ci.test("C", "T", "B", data = df, test = "cor")
t4 <- ci.test("C", "T", "D", data = df, test = "cor")
t5 <- ci.test("C", "T", c("A","B"), data = df, test = "cor")
t6 <- ci.test("C", "T", c("A","D"), data = df, test = "cor")
t7 <- ci.test("C", "T", c("B","D"), data = df, test = "cor")
t8 <- ci.test("C", "T", c("A","B","D"), data = df, test = "cor")
print(c(t1$p.value, t2$p.value, t3$p.value, t4$p.value, 
        t5$p.value, t6$p.value, t7$p.value, t8$p.value))

Output:

[1] 5.008861e-67 2.379113e-42 2.425548e-32 6.708171e-09 2.204601e-25
[6] 1.783842e-09 1.329351e-09 1.521039e-11

Best Answer

Any idea of why this happens?

You start your reasoning by stating that you would expect the statistical dependence between B and T given A (a confounder) to be smaller than the statistical dependence between B and T, that is: $I(B;T) > I(B;T|A)$, being $I$ the Mutual Information. Yes, you're right, it should. And it does. See the code below (based on the code you shared).

set.seed(120395)
A = rnorm(n = 100, mean = 0, sd = sqrt(1))
B = A + rnorm(n = 100, mean = 0, sd = sqrt(1))
T = A + B + rnorm(n = 100, mean = 0, sd = sqrt(1))    

miic::discretizeMutual(B,T, plot=FALSE)$info
0.7031494
miic::discretizeMutual(B,T, matrix_u=matrix(A), plot=FALSE)$info
0.2519784

The issue here is that you're confusing the statistical significance of the independence test with the effect size of the dependence. Even with your independence test, if you check the objects t1 and t2, you will see that you find a Pearson correlation of $0.89$ at first, and then $0.76$ after adjusting for the confounder. So your own simulation shows what you expected, which is indeed correct.

Regarding the p-values, you can inspect the objects returned by the ci.test function and you will see that what your p-values are suggesting that there is no independence, in any case. That's actually something that just now I noticed: Even though you mention d-separation in the title of your question, there is no d-separation in your question. You're confusing two different graphs. You probably think that the DAG you described through your three structural equations is the one below:

That is, $B$ and $T$ are independent but are observed to be dependent due to the confounding effect of $A$. You do not need to add the + B in the $T$ to have the spurious dependence, it's already there because both are caused by $A$. However, your structural equations led to the causal diagram below:

That is, $B$ and $T$ are dependent. And though you can decrease the dependence between them by adjusting for $A$, a confounding factor, you can not d-separate them, because they're directly dependent. It's the opposite, they're d-connected. That's why both p-values suggest a lack of independence. Besides, the first one being larger does not surprise me, since there is even less evidence that they are independent. By adjusting for a confounder, you decrease the dependence, and therefore make it less unlikely to be independent.

May that be because the probability distribution of the variables is not faithful to the DAG as pointed in the answer here?

No, because you made it faithful. You could create a set of structural equations and "hide" part of it, or create the relationships in a way that you would create non structural independencies (canceling pathways). However, that's not what you did. You made it pretty clear that $A$ causes $B$ and $T$, and $B$ causes $T$. Cinelli wrote some comments about that here.

Related Solutions

Solved – know how independent two variables are

+1 for a clearly laid-out question. Regarding terminology, people use terms in different ways--unfortunately, terms are not fully standardized. However, multivariate regression usually means regression when there is more than one response variable, whereas multiple regression usually means regression when there is more than one explanatory / predictor variable. Your case is clearly the latter, so the way you asked the question, and the tags you used, is correct.

From looking at your plots, you will need a multiple regression model with both $J$ and $Re$ included. Moreover, it looks like the relationship between these and $dcl$ is curvilinear for both variables, so you will need to include quadratic terms for both variables as well (i.e., $Re^2$ and $J^2$).

As for software, I'm sure Python can do this, but I'm not savvy enough with Python to tell you how. It's also pretty easy with R, something like the following code should do the trick:

df = read.table(file="<some_file_name>", header=TRUE, sep=",")
model = lm(dcl ~ J+I(J^2) + Re+I(Re^2), data=df)
summary(model)

I'm assuming here that your data are in a csv file where the first line lists column / variable names (headers). There is also a library in Python that allows you to call R from within Python, which you may feel more comfortable with, but I'm not very familiar with it.

Update:
Something that occurs to me here is that creating squared terms will generate a little bit of multicollinearity. What you could do is turn your variables ($J$ and $Re$) into z-scores first (you should be able to use ?scale) and then form squared terms. It's perfectly reasonable that $dcl$ is a function of the reciprocal of the square root; there's no reason you can use that. As for the results of your second model, they look pretty similar to the results from the first model to me, and even a little better. You want to be careful to ignore the stars and focus on the residual standard error and the multiple R-squared instead, and both of those metrics improve ever so slightly (note that this would be more complicated if you had differing numbers of variables in the two models).

As for the interpretation of R's output, a one unit change in a covariate is associated with an Estimate units change in the response all things being equal. (Note that it will not be possible for, say, $J$ to change without $J^2$ changing also, so you would need to do both.) If you didn't do a power analysis in advance with the intention of being able to differentiate a specific Estimate magnitude from 0, the p-values probably mean little.

Update2:
(I may not be the best person to advise you; I also try to avoid saying things that are completely foolish, but my track record is middling. ;-) Looking at your third model, and the path we've taken to get there, makes me think of a couple of things that are worth bearing in mind: First, it is strongly advisable not to drop simpler terms because they are not significant or even because it leads to a 'better' model (see here: does-it-make-sense-to-add-a-quadratic-term-but-not-the-linear-term-to-a-model, and here: including-the-interaction-but-not-the-main-effects-in-a-model--@whuber's answers especially, for example). Second, as we look at the data, decide on some covariates to try, model them, and then try something else, we are data dredging. This is rather worrying. (To see / understand this better, you may want to read my answer here: algorithms-for-automatic-model-selection.) Doing the things we've been doing here, and running simulations as we discussed in the comments, can help you to work though how you are going to think about what you find in your next experiment / how you plan for it. However, if you decide to draw conclusions / say something about these data on the basis of this, the risk of saying something completely foolish is very high. This is true even for the first iteration where I looked at your top plot and suggested including both variables with squared terms. You are probably on reasonable grounds to suggest there there may be a curvilinear relationship with both variables, but I would be cautious about firmly concluding more. (Note, for example, that although the terms in the last model have lots of stars, the residual standard error is larger and the multiple R-squared is smaller.) If, based on these data and scientific knowledge, you can come up with a couple of possibilities, you can design an experiment with your colleagues explicitly to differentiate amongst those possibilities.

As far as interpreting the final model, the fourth line is an interaction. This does not mean that $J$ is dependent on $Re$ or vice versa (although, they are clearly correlated). Rather, this means that the effect of, say, $J$ on $dcl$ depends on the level of $Re$. This is a difficult and nuanced concept. The way I typically explain it is to imagine if someone asked you a question about the effect of two variables on a third, would you use the term 'depends' or the phrase 'that doesn't matter'? For instance, if someone asked you about the effect of someone taking the birth control pill, you would say something like, 'Well, it depends, if you're a woman, it stops you from ovulating, but if you're a man, it doesn't'. Alternatively, if someone asked about how high a shelf someone can reach if they're 5'8" (173 cm) and being male versus female, you might say, 'that doesn't matter, if you're 5'8" you can reach a shelf that is 7' high whether you're male or female'.

Solved – How to calculate a 95% Credible Interval for a Bernouli sampling in R

Technically, you've not supplied enough information to answer as we'd need to know your prior.

However, I will assume you're looking at a $p\sim$Beta$(\alpha, \alpha)$ prior. In that case, if you have $N$ trials with a total of $Y$ successes the posterior distribution is $p | \mathcal D \sim$Beta$(\alpha + N - Y, \alpha + Y)$.

You can use the distribution function for Beta to calculate the necessary tails using those parameters.

New corrected example

Best Answer

Related Solutions

Solved – know how independent two variables are

Solved – How to calculate a 95% Credible Interval for a Bernouli sampling in R

Related Question