Solved – Using SMOTE with grouped, paneled, or categorical data

k nearest neighbourmachine learningresamplingsamplingunbalanced-classes

Let's say that I am building a classifier on imbalanced data. A sample of the data set looks like:

Person   Time1    Time2    Time3    Injury
A        3        2        3        0
A        3        3.4      1.2      0
A        2        2.1      2.1      1
B        0        2        2        0

etc. I want to use Person, Time1, Time2, and Time3 as features to classify Injury (this is just an example I'm making up). Now let's say that in my target Injury I have value counts of:

Label    Count
0        9000
1        50

I want to use SMOTE to both under-sample the majority class and over-sample the minority class. This is easy enough if I'm only using the numerical variables, but what do I do in this case where I have a grouping variable?

It theoretically is OK to have multiple positive Injury cases within any given Person. But how do I setup the SMOTE algorithm such that when it finds the kNN's and then generates the synthetic points between the kNN's and itself, that it retains the particular Person label of that data point?

Best Answer

It's very late but SMOTENC() is the correct method to do the oversampling for the mixture of categorical and numerical variables.

imblearn.over_sampling.SMOTENC