Simpson’s Paradox

correlationdata visualizationhypothesis testinginferencesimpsons-paradox

I need some help figuring out whether the phenomenon of Simpson's paradox has occurred.

Here is a plot of the first dataset (correlation between probability of contracting a disease vs. hours slept for the first and second plots is -0.8 and -0.2, respectively).

AND

Here is the plot it needs to be compared with (correlation between probability of contracting a disease vs. hours slept = 0.3).

Both data sets have the probability of contracting a disease on the $y$ axis and hours slept on the $x$ axis.

Best Answer

In order to understand this, look carefully at the combined plot:

Even though we can see that each group has a download sloping association between the variables, we can also see that when we look at the whole dataset, there is a positive association. If we drew a line of best fit, it would slope upwards. To really see this, you should actually draw each plot with a line of best fit - the infividual plots will have a downward sloping line, whereas the combined one will be upward sloping.

Edit: To show this with a simulation

We simulate 2 groups of data with similar characteristics to those in the OP

set.seed(1)
N <- 30
X1 <- runif(N, 2, 8)
X2 <- runif(N, -1, 6)

Y1 <- 60 - 2 * X1 + rnorm(N, 0, 2)
Y2 <- 40 - X2 + rnorm(N, 0, 2)

cor(X1, Y1); cor(X2, Y2)
[1] -0.9026919
[1] -0.7543316

So we have correlations of -0.9 and -0.75 respectively. Now let's plot them:

plot(X1, Y1, col = 'red', xlim = c(-1, 8), ylim = c(min(Y1, Y2), max(Y1, Y2)))
abline(lm(Y1 ~ X1), col = 'red')

points(X2, Y2, col = 'blue')
abline(lm(Y2 ~ X2), col = 'blue')

and finally we can add the line of best fit for the combined data:

abline(lm(c(Y1, Y2) ~ c(X1, X2)), 'black')

So we can see the downward sloping lines of the individual groups, and the upward sloping line of the combined data. And we can verify the correlation in the combined data:

cor(c(Y1, Y2) , c(X1, X2))
[1] 0.2066156

Best Answer

Related Solutions

Solved – How to resolve Simpson’s paradox

Solved – How to interpret height of density plot

Related Question