Simpson’s Paradox – Identifying Simpson’s Paradox in the Titanic Dataset

simpsons-paradox

With the well known "Survival of passengers on the Titanic" data set I get a strange behaviour by plotting the fare vs. the age (see below). Without a constraint on Pclass the correlation is positive. In contrast for all Pclasses the correlations seems to be negative.

I assume that's a form of "Simpson's Paradox". But I am not sure. How can this behaviour best explained for this special case?

# df is a pandas dataframe with the titanic data set
# see https://www.kaggle.com/c/titanic

import seaborn as sns
sns.jointplot("Age", "Fare", df, kind='reg')

Fare vs. Age for all Passenger Classes

sns.lmplot("Age", "Fare", df, col="Pclass")

enter image description here

Best Answer

Although Simpson's paradox (or Simpson's reversal) is more often referred in 3-way contingency tables than in correlation between continuous variables, it's the same phenomenon.

Here, the explanation in simple words seems clear: Although inside each class there is a slight tendency of decreasing fares with age, people in lower classes tend to be younger. That is, younger people tend to travel in lower classes and therefore younger people tends to pay lower fares.

About the fact that people is younger in the lower classes, you can see in the plot that there are a lot of children (age<18) in 3rd class, less of them in 2nd class (clearly less people in 0-20 than in 20-40), and very few children in 1st class. Comparing the 40-60 and 60-80 bands with the 20-40 band would also show that people tend to be younger in lower classes.

In summary: Yes, it is an occurrence of the Simpson's paradox. Younger people tend to travel in lower classes and therefore younger people tend to pays lower fares, even if they tend to pay a little more than older people in the same class.

And just a as a side comment: this one is not the only occurrence of Simpson's paradox in the Titanic dataset. In https://select-statistics.co.uk/blog/hidden-data-and-surviving-a-sinking-ship-simpsons-paradox/ or https://www2.stat.duke.edu/courses/Fall12/sta611/SimpsonsParadox.pdf another one is noticed.