Machine Learning – SMOTE Assumption on Positive Class Instances

I am trying to get my head around this assertion by Liu, Y. et al (2011 pp. 7) about SMOTE oversampling technique that:

because SMOTE makes the assumption that the instance between a positive class instance and its nearest neigh-
bors is also positive. However it may not always be true in practice. As a positive instance is very close to the boundary, its
nearest neighbor is likely to be negative, and this possibility may increase as K and the imbalance ratio become larger.

Maybe I am wrong but what I understand is that SMOTE selects the K neighbours from the minority (positive) class only, and generates synthetic sample along the line joining the minority example and those K-samples of the same class.

By having these authors claim that it may not always be true, I am confused. Does that mean a majority (negative) example maybe selected among K samples? Can anyone digest to mean what this actually means? I cannot understand what exactly the authors mean by this paragraph concluding statement.

Reference

Liu, Y., Yu, X., Huang, J.X. and An, A., 2011. Combining integrated sampling with SVM ensembles for learning from imbalanced datasets. Information Processing & Management, 47(4), pp.617-631.

Best Answer

In the original paper makes it clear that the nearest neighbours are from the minority class, so it appears that Liu et al. are mistaken.

Having said which, I am not sure the paper is entirely consistent with most implementations (or indeed itself)

I think for the synthetic point to be on a line joining the selected minority point with it's nearest neighbour, $gap$ would have to be assigned outside the loop over attributes, not inside it.

BTW if you are using a modern classifier, such as an SVM, that can implement cost-sensitive learning and has a good means of avoiding overfitting, then I would advise against using SMOTE. Techniques such as regularisation have a lot of solid theory behind them, the method used by SMOTE has none. I would also advise using a probabilistic classifier for problems with unequal misclassification costs etc. as you can account for the misclassification costs after the model has been fitted to the data.

Best Answer

Related Solutions

Machine Learning – Differences Between K-means and K-nearest Neighbours

Solved – Handling unbalanced data using SMOTE – no big difference

Related Question