Simpson’s Paradox

correlationdata visualizationhypothesis testinginferencesimpsons-paradox

I need some help figuring out whether the phenomenon of Simpson's paradox has occurred.

Here is a plot of the first dataset (correlation between probability of contracting a disease vs. hours slept for the first and second plots is -0.8 and -0.2, respectively).

enter image description here

AND

Here is the plot it needs to be compared with (correlation between probability of contracting a disease vs. hours slept = 0.3).

enter image description here

Both data sets have the probability of contracting a disease on the $y$ axis and hours slept on the $x$ axis.

Best Answer

In order to understand this, look carefully at the combined plot:

enter image description here

Even though we can see that each group has a download sloping association between the variables, we can also see that when we look at the whole dataset, there is a positive association. If we drew a line of best fit, it would slope upwards. To really see this, you should actually draw each plot with a line of best fit - the infividual plots will have a downward sloping line, whereas the combined one will be upward sloping.

Edit: To show this with a simulation

We simulate 2 groups of data with similar characteristics to those in the OP

set.seed(1)
N <- 30
X1 <- runif(N, 2, 8)
X2 <- runif(N, -1, 6)

Y1 <- 60 - 2 * X1 + rnorm(N, 0, 2)
Y2 <- 40 - X2 + rnorm(N, 0, 2)

cor(X1, Y1); cor(X2, Y2)
[1] -0.9026919
[1] -0.7543316

So we have correlations of -0.9 and -0.75 respectively. Now let's plot them:

plot(X1, Y1, col = 'red', xlim = c(-1, 8), ylim = c(min(Y1, Y2), max(Y1, Y2)))
abline(lm(Y1 ~ X1), col = 'red')

points(X2, Y2, col = 'blue')
abline(lm(Y2 ~ X2), col = 'blue')

and finally we can add the line of best fit for the combined data:

abline(lm(c(Y1, Y2) ~ c(X1, X2)), 'black')

enter image description here

So we can see the downward sloping lines of the individual groups, and the upward sloping line of the combined data. And we can verify the correlation in the combined data:

cor(c(Y1, Y2) , c(X1, X2))
[1] 0.2066156