Solved – Does Simpson’s Paradox cover all instances of reversal from a hidden variable

causalitycontingency tablesdata visualizationmathematical-statisticssimpsons-paradox

The following is a question about the many visualizations offered as 'proof by picture' of the existence of Simpson's paradox, and possibly a question about terminology.

Simpson's Paradox is a fairly simple phenomenon to describe and to give numerical examples of (the reason why this can happen is deep and interesting). The paradox is that there exist 2x2x2 contingency tables (Agresti, Categorical Data Analysis) where the marginal association has a different direction from each conditional association.

That is, comparison of ratios in two subpopulations can both go in one direction but the comparison in the combined population goes in the other direction. In symbols:

There exist $a,b,c,d,e,f,g,h$ such that
$$
\frac{a+b}{c+d} > \frac{e+f}{g+h}
$$

but
$$
\frac{a}{c} < \frac{e}{g}
$$
and

$$
\frac{b}{d} < \frac{f}{h}
$$

This is accurately represented in the following visualization (from Wikipedia ):

simpson' paradox in vectors

A fraction is simply the slope of the corresponding vectors, and it is easy to see in the example that the shorter B vectors have larger slope than the corresponding L vectors, but the combined B vector has smaller slope than the combined L vector.

There is a very common visualization in many forms, one in particular at the front of that wikipedia reference on Simpson's:

contrary slopes in sub populations

This is a great example of confounding, how a hidden variable (which separates two sub populations) can show a different pattern.

However, mathematically, such an image in no way corresponds to a display of the contingency tables that are at the basis of the phenomenon known as Simpson's paradox. First, the regression lines are over real-valued point set data, not count data from a contingency table.

Also, one can create data sets with arbitrary relation of slopes in the regression lines, but in contingency tables, there is a restriction in how different the slopes can be. That is, the regression line of a population can be orthogonal to all the regressions of the given subpopulations. But in Simpson's Paradox the ratios of the subpopulations, though not a regression slope, cannot stray too far from the amalgamated population, even if in the other direction (again, see the ratio comparison image from Wikipedia).

For me, that is enough to be taken aback every time I see the latter image as a visualization of Simpson's paradox. But since I see the (what I call wrong) examples everywhere, I'm curious to know:

  • Am I missing a subtle transformation from the original Simpson/Yule examples of contingency tables into real values that justify the regression line visualization?
  • Surely Simpson's is a particular instance of confounding error. Has the term 'Simpson's Paradox' now become equated with confounding error, so that whatever the math, any change in direction via a hidden variable can be called Simpson's Paradox?

Addendum: Here is an example of a generalization to a 2xmxn (or 2 by m by continuous) table:
percentage baskets made with respect to distance, hidden variable is type of shot

If amalgamated over shot type, it looks like a player makes more shots when defenders are closer. Grouped by shot type (distance from basket really), the more intuitively expected situation occurs, that more shots are made the further away defenders are.

This image is what I consider to be a generalization of Simpson's to a more continuous situation (distance of defenders). But I still don't see yet how the regression line example is an example of Simpson's.

Best Answer

The paradox is that there exist 2x2x2 contingency tables (Agresti, Categorical Data Analysis) where the marginal association has a different direction from each conditional association [...] Am I missing a subtle transformation from the original Simpson/Yule examples of contingency tables into real values that justify the regression line visualization?

The main issue is that you are equating one simple way to show the paradox as the paradox itself. The simple example of the contingency table is not the paradox per se. Simpson's paradox is about conflicting causal intuitions when comparing marginal and conditional associations, most often due to sign reversals (or extreme attenuations such as independence, as in the original example given by Simpson himself, in which there isn't a sign reversal). The paradox arises when you interpret both estimates causally, which could lead to different conclusions --- does the treatment help or hurt the patient? And which estimate should you use?

Whether the paradoxical pattern shows up on a contingency table or in a regression, it doesn't matter. All variables can be continuous and the paradox could still happen --- for instance, you could have a case where $\frac{\partial E(Y|X)}{\partial X} > 0$ yet $\frac{\partial E(Y|X, C = c)}{\partial X} < 0, \forall c$.

Surely Simpson's is a particular instance of confounding error.

This is incorrect! Simpson's paradox is not a particular instance of confounding error -- if it were just that, then there would be no paradox at all. After all, if you are sure some relationship is confounded you would not be surprised to see sign reversals or attenuations in contingency tables or regression coefficients --- maybe you would even expect that.

So while Simpson's paradox refers to a reversal (or extreme attenuation) of "effects" when comparing marginal and conditional associations, this might not be due to confounding and a priori you can't know whether the marginal or the conditional table is the "correct" one to consult to answer your causal query. In order to do that, you need to know more about the causal structure of the problem.

Consider these examples given in Pearl: enter image description here

Imagine that you are interested in the total causal effect of $X$ on $Y$. The reversal of associations could happen in all of these graphs. In (a) and (d) we have confounding, and you would adjust for $Z$. In (b) there's no confounding, $Z$ is a mediator, and you should not adjust for $Z$. In (c) $Z$ is a collider and there's no confounding, so you should not adjust for $Z$ either. That is, in two of these examples (b and c) you could observe Simpson's paradox, yet, there's no confounding whatsoever and the correct answer for your causal query would be given by the unadjusted estimate.

Pearl's explanation of why this was deemed a "paradox" and why it still puzzles people is very plausible. Take the simple case depicted in (a) for instance: causal effects can't simply reverse like that. Hence, if we are mistakenly assuming both estimates are causal (the marginal and the conditional), we would be surprised to see such a thing happening --- and humans seem to be wired to see causation in most associations.

So back to your main (title) question:

Does Simpson's Paradox cover all instances of reversal from a hidden variable?

In a sense, this is the current definition of Simpson's paradox. But obviously the conditioning variable is not hidden, it has to be observed otherwise you would not see the paradox happening. Most of the puzzling part of the paradox stems from causal considerations and this "hidden" variable is not necessarily a confounder.

Contigency tables and regression

As discussed in the comments, the algebraic identity of running a regression with binary data and computing the differences of proportions from the contingency tables might help understanding why the paradox showing up in regressions is of similar nature. Imagine your outcome is $y$, your treatment $x$ and your groups $z$, all variables binary.

Then the overall difference in proportion is simply the regression coefficient of $y$ on $x$. Using your notation:

$$ \frac{a+b}{c+d} - \frac{e+f}{g+h} = \frac{cov(y,x)}{var(x)} $$

And the same thing holds for each subgroup of $z$ if you run separate regressions, one for $z=1$:

$$ \frac{a}{c} - \frac{e}{g} = \frac{cov(y,x|z =1)}{var(x|z=1)} $$

And another for $z =0$:

$$ \frac{b}{d} - \frac{f}{h} = \frac{cov(y,x|z=0)}{var(x|z=0)} $$

Hence in terms of regression, the paradox corresponds to estimating the first coefficient $\left(\frac{cov(y,x)}{var(x)}\right)$ in one direction and the two coefficients of the subgroups $\left(\frac{cov(y,x|z)}{var(x|z)}\right)$ in a different direction than the coefficient for the whole population $\left(\frac{cov(y,x)}{var(x)}\right)$.

Related Question