Machine Learning – Handling Highly Unbalanced Test Data Set and Balanced Training Data in Classification

classificationdata miningmachine learningsvm

I have a training set with about 3000 positive instances and 3000 negative instances. But my test data set is pretty much un-balanced. The positive set only has 50 instances and negative has 1500 instances.This causes the precision very low. Are there any approaches to solve this problem? I use SVM to build classifier.

Best Answer

This is called Dataset Shift setting. This pdf [1] should help you understand several of the underlying issues involved.

For the moment however, you can use least squares importance fitting to obtain importance estimates for your training data using your test set (you don't need the test set labels, just the feature vectors) [2]. Once you gain the importance estimates, you can use them as instance weights in libSVM [3].

That should enable you to get a better classifier.

[1] http://www.acad.bg/ebook/ml/The.MIT.Press.Dataset.Shift.in.Machine.Learning.Feb.2009.eBook-DDU.pdf
[2] http://www.ms.k.u-tokyo.ac.jp/software.html#uLSIF
[3] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances

Related Question