Solved – training approaches for highly-imbalanced data set

bioinformaticsclassificationdata miningmachine learningsvm

I have a highly-imbalanced test data set. The positive set consists of 100 cases while the negative set consists of 1500 cases. On the training side, I have a larger candidate pool: the positive training set has 1200 cases and the negative training set has 12000 cases. For this kind of scenario, I have several choices:

1) Using weighted SVM for the whole training set (P: 1200, N: 12000)

2) Using SVM based on the sampled training set (P:1200, N :1200), the 1200 negative cases are sampled from 12000 cases.

Is there any theoretical guidance on deciding which approach is better? Since the test data set is highly imbalanced, should I use the imbalanced training set as well?

Best Answer

From a recent post on reddit, the reply by datapraxis will be of interest.

edit: the paper mentioned is Haibo He, Edwardo A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, pp. 1263-1284, September, 2009 (PDF)