I was provided with a heavily imbalanced medical dataset (90-10% proportion among the negative/positive classes) to perform classification.
In order to mitigate the imbalance, I have resorted to oversampling the minority class through SMOTE in order to obtain a balanced dataset.
Since I needed cross-validated results, I performed oversampling only on the training partition, leavining the test partition untouched.
The problem is that since the proportion of positive/negative examples changes from train to test, the classifier behaves poorly because it somehow learns the frequency of the two classes in training, and then wrongly uses this notion in the test phase (producing a lot of false positives).
Any idea how I can overcome this problem?
Best Answer
Say the dataset is composed of $N$ and $P$, negative and positive, respectively, with $|P| = \frac{1}{9} |N|$ in the dataset, but with the true-life ratio being $|P| = \alpha |N|$ for some $\alpha > \frac{1}{9}$ (e.g., $\alpha=1$ means that, in real life, positives and negatives are approximately as frequent).
Partition the negative samples into two parts, $N_1$, $N_2$, s.t. $|N_2| = \alpha |P|$.
For example, in the following figure, $N_1, N_2, P$ are the parts in blue, cyan, grey, respectively.
To perform 3-fold CV, for example, partition each of the parts into 3. The first fold, for example, would consist of the top 2 blue and top 2 gray for train, and the bottom 1 cyan and bottom 1 gray for test.
Note that the test set is $\alpha$ balanced. In the train set, you'd use SMOTE to $\alpha$ balance as you're doing now.
(Of course, you can adapt this to other methods besides k-fold.)
Unfortunately, for the percentages you mention, the test will probably be relatively noisy (unless your dataset is large). Personally, I don't see a way around that.