Solved – Cross validation and imbalanced learning

classificationcross-validationmachine learningunbalanced-classes

I was provided with a heavily imbalanced medical dataset (90-10% proportion among the negative/positive classes) to perform classification.

In order to mitigate the imbalance, I have resorted to oversampling the minority class through SMOTE in order to obtain a balanced dataset.
Since I needed cross-validated results, I performed oversampling only on the training partition, leavining the test partition untouched.

The problem is that since the proportion of positive/negative examples changes from train to test, the classifier behaves poorly because it somehow learns the frequency of the two classes in training, and then wrongly uses this notion in the test phase (producing a lot of false positives).

Any idea how I can overcome this problem?

Best Answer

Say the dataset is composed of $N$ and $P$, negative and positive, respectively, with $|P| = \frac{1}{9} |N|$ in the dataset, but with the true-life ratio being $|P| = \alpha |N|$ for some $\alpha > \frac{1}{9}$ (e.g., $\alpha=1$ means that, in real life, positives and negatives are approximately as frequent).

Partition the negative samples into two parts, $N_1$, $N_2$, s.t. $|N_2| = \alpha |P|$.

For example, in the following figure, $N_1, N_2, P$ are the parts in blue, cyan, grey, respectively.

enter image description here

To perform 3-fold CV, for example, partition each of the parts into 3. The first fold, for example, would consist of the top 2 blue and top 2 gray for train, and the bottom 1 cyan and bottom 1 gray for test.

Note that the test set is $\alpha$ balanced. In the train set, you'd use SMOTE to $\alpha$ balance as you're doing now.

(Of course, you can adapt this to other methods besides k-fold.)


Unfortunately, for the percentages you mention, the test will probably be relatively noisy (unless your dataset is large). Personally, I don't see a way around that.