Causality – Addressing Stable Violation of Faithfulness

bayesian networkcausalitystructural-equation-modeling

Faithfulness is often justified by the argument that any violations of it require very specific "fine-tuned" parameters (for some appropriate SCMs/SEMs/SFMs), and that such violations are going to be necessarily "unstable". I.e. it is argued that parameters leading to violations of faithfulness are rare (in comparison to the set of all possible system/model parameters) and so unlikely to be dealt with in practice; furthermore, even slight changes to any particular model parameters that bring about a violation may upend the "fine-tuning" and restore faithfulness, hence making the assumption of faithfulness reasonable to hold in practice. Some classic examples of faithfulness being violated in fact demonstrate these points, and may be used to support the presumption of faithfulness. Consider:

  1. Path cancellation: Consider a linear SCM with endogenous variables $V=\{A,B,C\}$, and with the structural equations $A := U_{1}$, $B∶=\alpha A$, and $C∶=\gamma A + \beta B + U_{2}$, where $U_{1}$ and $U_{2}$ are independent "exogenous" noise terms. Now, by substitution, we derive that:
    $$C∶=\gamma U_{1} + \beta \alpha U_{1} + U_{2}$$
    Hence, if $\gamma = – \beta \alpha $ precisely holds, $ U_{1}$ disappears from the RHS, and $C$ and $A$ will become unconditionally independent. This holds even though there are two open paths between them (due to the fact that the influence of the paths precisely negate each other), hence we have a failure of statistical independence implying d-separations (and so a violation of faithfulness). Yet, over many reasonable spaces of possible values for $\{\alpha,\beta,\gamma\}$, the fraction of parameter assignments that yield the necessary equality $\gamma = – \beta \alpha $, will occupy a subset of 0-measure. Furthermore, even in such cases, an arbitrarily small perturbation to a single parameter from $\{\alpha,\beta,\gamma\}$ can always restore faithfulness. Hence, if anything, this theoretical example shows how "fickle" and rare a violation to faithfulness should be (in the context of linear systems).

  2. Functional independence: Consider another SCM with three endogenous variables $V=\{A,B,C\}$ and two exogenous noise terms $U=\{U_{1},U_{2}\}$. Now, let the noise terms obey i.i.d Bernoulli distributions with $p=0.5$ (i.e. independent fair coin tosses). Also, let $A := U_{1}$, $B∶=U_{2}$, and $C := XOR(A,B)$. It can easily be shown that the pairs $A$ and $C$, and $B$ and $C$ are unconditionally independent, even though they are clearly d-connected, yielding another violation of faithfulness. However, we can also easily show that any arbitrarily small deviation from the parameter value of $p=0.5$ for $U_{1}$ and $U_{2}$ breaks the "fine-tuning" and restores faithfulness…

Now, I am aware that there are plenty of "applied" criticisms of faithfulness w.r.t real world counterexamples. E.g. for biological systems, it is argued that violations of faithfulness seem to be common if not abundant (e.g. consider homeostatic processes). However, I seem to have come across a category of very simple "stable" faithfulness violations that I have not seen in the literature, and which can be viewed as an argument against faithfulness that is more theoretical in nature… Consider the following DAG:

$\hspace{6cm}$enter image description here

This is a DAG of the following SCM: $V=\{A,B,C,D\}$ with $U=\{U_{1},U_{2},U_{3},U_{4}\}$ where $U_{1}$ is Bernoulli with $p=0.5$, and $U_{2},U_{3},U_{4}$ are i.i.d. uniform RVs over some small interval $[-d,d]$ (where $d<<0.5$). Let $A := U_{1}$, $B∶=U_{2}+A$, $C∶=U_{3}+A$, and $D∶=U_{4}+A$. This clearly yields a DAG structure with $A$ as the root node of the three forks/divergent paths connecting $B,C$ and $D$. Now, the rules of d-separation do not say anything about conditioning on a child node of the middle node of a fork; presumably any path with such a fork in it remains open (presuming, of course, that the child variable we are conditioning on is not itself also a node of the path). So conditioning on say, $B$, should not d-separate $C$ and $D$ (make them conditionally independent). However, because observing $B$ lets us, essentially, infer the value of $A$, observing $B$ does in fact render $C$ and $D$ conditionally independent. By symmetry, the same holds for $C$ w.r.t $B$ and $D$ and for $D$ w.r.t $B$ and $C$. Since we can have conditional independence without having a corresponding d-separation, we have obtained a violation of faithfulness.

Importantly, the type of violation we have found is stable: the violation holds for any value of $p$ (the probability of success for the Bernoulli $U_1$) and for any value of $d$ for the interval bounds of $U_1,U_2,U_3$ as long as $d<0.5$ holds.

Some may argue that $B$ (or equivalently $C$ or $D$) is just a proxy for $A$, and so conditioning on $B$ is really tantamount to conditioning on $A$ (and so one can argue, we are in fact d-separating $C$ and $D$ as we really are conditioning on $A$). However, such a statement, while apparent in hindsight, may not be known to us when we are attempting causal discovery/inference. More importantly, we can rebut such "proxy" criticisms as it may very well be possible to intervene on $B$ without affecting $A$ (i.e. if indeed $A$ and $B$ pertain to meaningfully/semantically/causally distinct objects/events/variables of our environment such that one is not a mere measurement proxy of the other). Basically, our rebuttal is that it is possible to find genuine examples of the the causal structure we have detailed.

I believe that what we have described should confuse constraint-based causal discovery algorithms… Consider what would happen if we tried to run IC on simulated data from our SCM (we would get a fully disconnected graph missing all the actual arrows going out from $A$).

At long last, we get to my questions:

  1. Have I misunderstood something – i.e. is this, in fact, not a stable violation of faithfulness?
  2. If it is, where is the literature discussing it…
  3. Moreover, the type of faithfulness violation that I have described here could appear whenever some parent variable has considerable influence on more than just 1 child (or on more than 2 child variables if the parent variable itself is latent) and the parent variable itself or its causal effects are discrete/non-linear enough. Since this seems like a very "elementary" and plausible scenario for many systems, the resulting violations should be common and should be causing plenty of trouble for causal discovery algorithms. How come many causal discovery algorithms then seem not to be attuned to this kind of a "scenario"? Is it simply not a problem in practice?

I suppose one could argue that the issue behind all of this really lies with how d-separation is defined… e.g. there is needs to be a rule that "conditioning on the descendant of the middle/internal node of a fork (or the internal node of a chain for the matter) can block the path"…

Finally, my actual question is now done, but for the curious/record, I have additionally summarised some background on where/how this issue with faithfullness/d-separation occurred to me:

EXTRA CONTEXT ABOUT THE VIOLATION:

I came up with this violation while pondering causal discovery in causally insufficient settings such as via the IC* algorithm. The basis of IC*, as detailed here is that any BN/SCM that is only partially observable (i.e. there are latent variables) has a corresponding projection over the same observable variables that satisfies exactly the same conditional independence (CI) statements w.r.t the observable variables, hence the original system can be, up to CI/Markov equivalence, represented by a directed/mixed acyclic graph with bidirected edges (a.k.a its projection). In turn, the IC* algorithm can discover such projections, up to their Markov Equivalence Classes.

From the get-go, it seemed to me that the IC* algorithm shouldn't work…. For it allows us to make out of 1 latent confounder, 2 or more independent latent variables… To see the problem with doing this: let $A$ from the same SCM we described above now become a latent variable. Our system now looks like this:

$\hspace{6cm}$enter image description here

We can show that there is no valid projection (in the sense defined by Verma/Pearl) for the now partially observable system, and so no DAG (with or without bidirected edges) exists that can correctly represent the CI and conditional dependence (CD) statements that will hold for $\{B,C,D\}$. That is, the output of the projection algorithm described by Verma/Pearl gives an "incorrect" structure: consisting of 3 bidirected edges connecting $B$ with $C$, $B$ with $D$, and $C$ with $D$:

$\hspace{6cm}$enter image description here

That is, I think that w.r.t such a structure, there will be no possible independent RVs ${U_5,U_6,U_7}$, where $B∶=f_{B}(U_{5},U_{6})$,$C∶=f_{C}(U_{6},U_{7})$ and $D∶=f_{D}(U_{5},U_{7})$, that can yield the required CI conditions, let alone yield observational equivalence.

To see why, recall that the observable RVs $\{B,C,D\}$ are all determined by the value of essentially 1 random latent variable $A$ (plus some small additional noise). Yet the projection theorem of Verma implies that we can find 3 independent exogenous RVs that can essentially mimic all the effects of $A$ (at least w.r.t all CI and CD statements) on the observable set $\{B,C,D\}$ while respecting the fact that each of the 3 exogenous RVs only influences the value of 2 out of 3 endogenous RVs. I believe that this simply cannot be done…

Here is an informal argument to show why observation equivalence cannot be achieved… Recall that observing $X$, where $X$ is any single 1 out of the 3 observable variables, lets us almost fully determine the value of the other 2 variables of the observable triple (up to some small noise/error bounded in value by $d$). In turn, it follows that the 2 exogenous variables responsible for determining X almost fully determine the entire system, so the one exogenous variable not involved in determining $X$, must be redundant. However, as $X$ is arbitrary, it follows that all 3 of the confounding exogenous variables comprising the hidden variables of the projection are redundant for determining the observables.

So the observables must be chiefly determined by what remains, i.e. by their independent exogenous noise terms, but these are independent and 1-to-1, and so cannot yield the joint dependence induced by $A$ (that we need to emulate).

CI and CD equivalence also does not hold as the structure given by the projection, (consisting of 3 bidirected edges connecting the 3 observables) suggests that conditioning on any single variable, yields 2 open paths connecting connecting the other two, hence they should be dependent, but we know that they will be rendered independent…

Although, to be fair, the original structure (with $A$ being observable) also fails to accurately capture the CI structure of the actual model… So really the issue here lies not with IC* in particular, but applies to IC, PC, etc. (all constraint based causal discovery methods) as mentioned above…

Best Answer

enter image description here

Conditioning on the child of a confounder does not (in general) block the path.

The part in red above rests on the assumption that observing $B$ lets us infer the value of $A$. This is not the case if $A$ is not unique given $B$ due to the exogenous noise on $B$. So for a given value of $B$, there may still be a possibility of variation in $A$, which results in dependence between $C$ and $D$. That's why controlling for $B$ does not block the path between $C$ and $D$ and, more generally, why controlling for descendants of confounders is not covered by the necessary and sufficient criterion for d-separation as stated in Definition 1.2.3, "Causality", 2nd edition

Note that this is in contrast to the case of colliders, where conditioning on the child of a collider unblocks the path (see e.g. page 17 in "Causality", 2nd edition).

Quasi-proxy variables. After some deliberation, I agree that the example you have constructed of a quasi-proxy variable that results in a faithfulness violation is different, and it's not obvious to me if this case is addressed somewhere in the literature. I can imagine that some other model assumption may prevent the case of a variable that can be perfectly predicted from another (in this case its child), but I am not sure. Yours is a very interesting question that may not have been investigated in detail in the literature so far.

Edit: I've updated the answer in response to the comments.

Related Question