Machine Learning – How to Make a Machine Learning Model Robust to Simpson’s Paradox

categorical datamachine learningneural networkssimpsons-paradoxunbalanced-classes

Suppose a model which predicts which location/landmark a walking tourist is going to visit next, based on two geographical input features:

the last neighborhood this person has walked through
the second-to-last neighborhood this person has walked through (so, before the last one)

The distribution of the training data looks as follows:

before-last-neighborhood	last-neighborood	next-visited-place	count
central park	times square	South of times square	10,000
central park	times square	North of times square	16
wall street	times square	North of times square	90
wall street	times square	South of times square	3

In this case, because of the imbalance of the data, the issue is that the model puts a very strong predictive power on times-square => somewhere south, regardless of the neighborhood crossed before that.

However, common sense would have anyone say that someone going through Times Square from Wall Street is probably going North, not South. And the training data actually reflects this, it just happens that there were many more people crossing Times Square from the North, than from the South.

What would be some effective ways to make a model more robust to this phenomenon, and effectively learn that e.g. wall street + times-square => going north?

I've tried different forms of feature engineering, adding more features (e.g. nationality of the tourist, gender, and other similar attributes), combining (before-last-zone + last-zone) into a single categorical feature, but all this only helps marginally. The reality is still that most people crossing Times Square are doing so Southbound, and the model will insist on predicting South for anyone crossing Times Square, regardless of their provenance.

In a neural network context, would there be for instance a way to assign somehow "more weight" to a particular combination of features, as a way to tell the model that these are the most importants features to look at?

I'd also like to mention that the real-world problem is much more complex than this and it would be intractable to try to artificially rebalance the data via bagging/bootstrapping. This would be a 50+ dimensional task with literally thousands of edge cases to consider.

Best Answer

Sometimes the simplest solutions are the one we think of last.

I eventually removed the individual features last-neighborhood and before-last-neighborhood, to replace them with a single combined feature instead: two-last-neighborhoods (previous and last, in order).

This way, the model doesn't put everything that has last-neighborhood: times-square in the same bracket anymore, and those who cross Times Square from Wall Street are now (correctly) interpreted as going somewhere North.

Related Solutions

Confounding – Understanding Simpson’s Paradox and Confounding

Simpson's paradox is an extreme form of confounding where the apparent sign of correlation is reversed; you haven't said this is the position here.

I can see at least three possibilities here: the heterogenity between the subgroups, the reduction in sample sizes in each, and poor definition of the subgroups which presuppose the results. Ignoring the third, both of the first two can have an impact: from past experience it is often the small sample size which lead to non-significance in the smaller subgroup and heterogenity which causes the whole group to produce a significant result wile the large subgroup does not.

That was an over-generalisation - each case will have its own issues.

Simpson’s Paradox – Understanding the Basics of Simpson’s Paradox

I think A and E aren't a good combination, because A says you should pick Mercy and E says you should pick Hope.

A and D have the virtue of advocating the same choice. But, lets examine the line of reasoning in D in further detail, since that seems to be the confusion. The probability of success for the surgeries follows the same ordering at both hospitals, with the A type being most likely to be successful and the E type being the least likely. If we collapse over (i.e., ignore) the hospitals, we can see that the marginal probability of success for the surgeries is:

Type     A     B     C     D     E     All  
Prob   .81   .78   .56   .21   .08     .52

Because E is much less likely to be successful, it is reasonable to imagine that it is more difficult (although in the real world, other possibilities exist as well). We can extend that line of thinking to the other four types also. Now lets look at what proportion of each hospital's total surgeries are of each type:

Type     A     B     C     D     E  
Mercy  .08   .39   .06   .44   .03  
Hope   .09   .54   .23   .09   .05

What we notice here is that Hope tends to do more of the easier surgeries A-C (and especially B & C), and fewer of the harder surgeries like D. E is pretty uncommon in both hospitals, but, for what it's worth, Hope actually does a higher percentage. Nonetheless, the Simpson's Paradox effect is going to mostly be driven by B-D here (not actually column E as answer choice D suggested).

Simpson's Paradox occurs because the surgeries vary in difficulty (in general) and also because the N's differ. It is the differing base rates of the different types of surgeries that makes this counter-intuitive. What is happening would be easy to see if both hospitals did exactly the same number of each type of surgery. We can do that by simply calculating the success probabilities and multiplying by 100; this adjusts for the different frequencies:

Type     A     B     C     D     E     All  
Mercy   81    79    60    21    09     250  
Hope    80    76    51    14    04     225

Now, because both hospitals did 100 of each surgery (500 total), the answer is obvious: Mercy is the better hospital.

Best Answer

Related Solutions

Confounding – Understanding Simpson’s Paradox and Confounding

Simpson’s Paradox – Understanding the Basics of Simpson’s Paradox

Related Question