Machine Learning – How to Make a Machine Learning Model Robust to Simpson’s Paradox

categorical datamachine learningneural networkssimpsons-paradoxunbalanced-classes

Suppose a model which predicts which location/landmark a walking tourist is going to visit next, based on two geographical input features:

  • the last neighborhood this person has walked through
  • the second-to-last neighborhood this person has walked through (so, before the last one)

The distribution of the training data looks as follows:

before-last-neighborhood last-neighborood next-visited-place count
central park times square South of times square 10,000
central park times square North of times square 16
wall street times square North of times square 90
wall street times square South of times square 3

In this case, because of the imbalance of the data, the issue is that the model puts a very strong predictive power on times-square => somewhere south, regardless of the neighborhood crossed before that.

However, common sense would have anyone say that someone going through Times Square from Wall Street is probably going North, not South. And the training data actually reflects this, it just happens that there were many more people crossing Times Square from the North, than from the South.

What would be some effective ways to make a model more robust to this phenomenon, and effectively learn that e.g. wall street + times-square => going north?

I've tried different forms of feature engineering, adding more features (e.g. nationality of the tourist, gender, and other similar attributes), combining (before-last-zone + last-zone) into a single categorical feature, but all this only helps marginally. The reality is still that most people crossing Times Square are doing so Southbound, and the model will insist on predicting South for anyone crossing Times Square, regardless of their provenance.

In a neural network context, would there be for instance a way to assign somehow "more weight" to a particular combination of features, as a way to tell the model that these are the most importants features to look at?

I'd also like to mention that the real-world problem is much more complex than this and it would be intractable to try to artificially rebalance the data via bagging/bootstrapping. This would be a 50+ dimensional task with literally thousands of edge cases to consider.

Best Answer

Sometimes the simplest solutions are the one we think of last.

I eventually removed the individual features last-neighborhood and before-last-neighborhood, to replace them with a single combined feature instead: two-last-neighborhoods (previous and last, in order).

This way, the model doesn't put everything that has last-neighborhood: times-square in the same bracket anymore, and those who cross Times Square from Wall Street are now (correctly) interpreted as going somewhere North.