Simpson’s Paradox Explained with Equations – How to Understand Rather Than Using Contingency Tables

mathematical-statisticssimpsons-paradox

I probably don't have a clear understanding of the Simpson's paradox. Informally I know that the average of response Y1, grouped over all possible levels of factor A, can be higher than the average of response Y2 over all levels of A, even if the average of Y1 for each level of A (each group) is always less than the corresponding average of Y2. I've read examples, but I still get surprised each time I see it, maybe because I don't learn well by specific examples: I have troubles generalizing them. I learn best, and would rather see, an explanation in formulas. Can you please explain the paradox relying on equations, instead than count tables?

Also, I think the reason for my surprise is that I could unconsciously be making some assumptions about the averages involved in the paradox, which may not be true in general. Maybe I forget to weight by the number of samples in each group? But then, I would like to see an equation which shows me that the estimate of the total average is more accurate if I weight each group average by the number of samples in each group, because (if this is true) it isn't obvious to me in general. Naively I would think that the estimate of $\mathbf{E}[Y_1]$ has a lower standard error when I have more samples, irrespective of weighting.

Best Answer

Here is a general approach to understanding Simpson's Paradox algebraically for count data.

Suppose that we have survival data for an exposure and we create a 2x2 contingency table. To keep things simple we will have the same counts in each cell. We could relax this, but it would make the algebra quite messy.

\begin{array}{|c|c|c|c|} \hline & \text{Died} & \text{Survived} & \text{Death Rate} \\ \hline \text{Exposed} & X & X & 0.5 \\ \hline \text{Unexposed}& X & X & 0.5\\ \hline \end{array}

In this case, the Death Rate is the same in both the Exposed and Unexposed groups.

Now, if we split the data, say into one group for females and another group for males, we obtain 2 tables, with the following counts:

Males: \begin{array}{|c|c|c|c|} \hline & \text{Died} & \text{Survived} & \text{Death Rate} \\ \hline \text{Exposed} & Xa & Xb & \frac{a}{a+b} \\ \hline \text{Unexposed}& Xc & Xd & \frac{c}{c+d}\\ \hline \end{array}

and for females: \begin{array}{|c|c|c|c|} \hline & \text{Died} & \text{Survived} & \text{Death Rate} \\ \hline \text{Exposed} & X(a-1) & X(b-1) & \frac{a-1}{a+b-2} \\ \hline \text{Unexposed}& X(c-1) & X(d-1) & \frac{c-1}{c+d-2}\\ \hline \end{array}

where $a,b,c,d \in [0,1]$ are the proportions of each cell in the aggregated data table that are male.

Simpson's Paradox will occur when the death rates for exposed males is greater than the death rate for unexposed males AND the death rate for exposed females is greater than the death rate for unexposed females. Alternatively, it will also occur when the death rates for exposed males is less than the death rate for unexposed males AND the death rate for exposed females is less than the death rate for unexposed females. That is, when

$$\left(\frac{a}{a+b} < \frac{c}{c+d}\right) \text{ and } \left(\frac{a-1}{a+b-2} < \frac{c-1}{c+d-2}\right)$$

$$ \text{Or }$$

$$\left(\frac{a}{a+b} > \frac{c}{c+d}\right) \text{ and } \left(\frac{a-1}{a+b-2} > \frac{c-1}{c+d-2}\right)$$

As a concrete example, let $X=100$, and $a=0.5, b=0.8, c=0.9$. Then we will have Simpson's paradox when:

$$\left(\frac{0.5}{0.8+0.9} < \frac{0.9}{0.9+d}\right) \text{ and } \left(\frac{0.5-1}{0.5+0.8-2} < \frac{0.9-1}{0.9+d-2}\right)$$

$$ (-9 < d < 1.44) \text{ and } (0.96 < d < 1.1) $$

From which we conclude that d must lie in $(0.96,1]$

The 2nd set of inequalities gives:

$$\left(\frac{0.5}{0.8+0.9} > \frac{0.9}{0.9+d}\right) \text{ and } \left(\frac{0.5-1}{0.5+0.8-2} > \frac{0.9-1}{0.9+d-2}\right)$$

$$ (d < -0.9 \text{ or } d>1.44) \text{ and } (0.96 < d \text{ or } d > 1.44) $$

which has no solution for $d \in [0,1]$

So for the three values that we chose for $a,b,$ and $c$, to invoke Simpson's paradox, $d$ must be greater than 0.96. In the case where the value was $0.99$ then we would obtain a Death Rate for Males of

$$ 0.5/ (0.5+0.8) = 38 \text{% in the exposed group} $$ $$ 0.9/ (0.9+0.99) = 48 \text{% in the unexposed group} $$

and for Females:

$$ (0.5-1)/ (0.5+0.8-2) = 71 \text{% in the exposed group} $$ $$ (0.9-1)/ (0.9+0.99-2) = 91 \text{% in the unexposed group} $$

So, males have a higher death rate in the unexposed group than in the exposed group, and females also have a higher death rate in the unexposed group than the exposed group, yet the death rates in the aggregated data are the same for exposed and unexposed.