Suppose a model which predicts which location/landmark a walking tourist is going to visit next, based on two geographical input features:
- the last neighborhood this person has walked through
- the second-to-last neighborhood this person has walked through (so, before the last one)
The distribution of the training data looks as follows:
before-last-neighborhood | last-neighborood | next-visited-place | count |
---|---|---|---|
central park | times square | South of times square | 10,000 |
central park | times square | North of times square | 16 |
wall street | times square | North of times square | 90 |
wall street | times square | South of times square | 3 |
In this case, because of the imbalance of the data, the issue is that the model puts a very strong predictive power on times-square => somewhere south
, regardless of the neighborhood crossed before that.
However, common sense would have anyone say that someone going through Times Square from Wall Street is probably going North, not South. And the training data actually reflects this, it just happens that there were many more people crossing Times Square from the North, than from the South.
What would be some effective ways to make a model more robust to this phenomenon, and effectively learn that e.g. wall street
+ times-square
=> going north
?
I've tried different forms of feature engineering, adding more features (e.g. nationality of the tourist, gender, and other similar attributes), combining (before-last-zone + last-zone)
into a single categorical feature, but all this only helps marginally. The reality is still that most people crossing Times Square are doing so Southbound, and the model will insist on predicting South for anyone crossing Times Square, regardless of their provenance.
In a neural network context, would there be for instance a way to assign somehow "more weight" to a particular combination of features, as a way to tell the model that these are the most importants features to look at?
I'd also like to mention that the real-world problem is much more complex than this and it would be intractable to try to artificially rebalance the data via bagging/bootstrapping. This would be a 50+ dimensional task with literally thousands of edge cases to consider.
Best Answer
Sometimes the simplest solutions are the one we think of last.
I eventually removed the individual features
last-neighborhood
andbefore-last-neighborhood
, to replace them with a single combined feature instead:two-last-neighborhoods
(previous and last, in order).This way, the model doesn't put everything that has
last-neighborhood: times-square
in the same bracket anymore, and those who cross Times Square from Wall Street are now (correctly) interpreted as going somewhere North.