Solved – SVM heavily over fits the data (classifying Highly Unbalanced data )

classificationmachine learningrandom forestscipysvm

I have a huge training set from which I am supposed to regress and classify, i.e I classify whether an event will occur or not and another task is to regress the intensity of the event in future.
The problem I am battling with is that there are very few positive instances for classification in my training and test set (2% to be accurate). As a result, whatever method I try, my precision and recall for the rarer class do not increase more than 35% and 10% respectively. I also tried using the class weights or sample weights but to no avail. When I try svm using Scipy's SVC module, it heavily overfits the data, i.e gives more than 90% accuracy for both the classes on training data but gives 0 precision and 0 recall Similarly in the regression problem since there are a lot of 0's in the training set. Regressed values do not make any sense at all.

So my question is two fold , first what could be the reason for SVM to overfit to the data?and second What can I use to increase the precision and recall of rarer event more (I tried random forest which gives 62% precision and 55% recall), I have tried giving sample weights but it doesnt work (It increases precision to 63% in RF but drops recall)

Even Giving class 1 a weight of 100, class_weight = {1:100} doesnt solve the problem

Best Answer

The usual solution to imbalanced data is to use class-weighted SVM, which has two misclassification penalties $C_{pos}$ and $C_{neg}$ instead of one. You assign a higher misclassification penalty to the minority class. A common heuristic is to keep the ratio as follows: $$C_{pos} \times n_{pos} = C_{neg} \times n_{neg}$$ where $n_X$ is the size of $X$. You can assign these by scaling $C$ via coefficients in sklearn (e.g. the class_weight parameters in SVC).

A better approach would be to tune both of these parameters (+ additional kernel parameters). If you are using Python, you could do this using tuning libraries like Optunity (examples with scikit-learn are included on the webpage).