Solved – Assigning weights to a multilabel SVM to balance classes

machine learningscikit learnsvm

How is this done? I am using Sklearn to train an SVM. My classes are unbalanced. Note that my problem is multiclass, multilabel so I am using OneVsRestClassifier:

mlb = MultiLabelBinarizer()
y = mlb.fit_transform(y_train)

clf = OneVsRestClassifier(svm.SVC(kernel='rbf'))
clf = clf.fit(x, y) 
pred = clf.predict(x_test)

Can I add a 'sample_weight' parameter somewhere to account for the unbalanced classes? If I add a class_weight dict to the svm I get the error:

ValueError: Class label 2 not present

This is because I have converted my labels to binary using the mlb. However, if I do not convert the labels to binary, I get:

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.

class_weight is a dict, mapping the class labels to the weight: {1: 1, 2: 1, 3: 3...}

Here are the details of x and y:

print(X[0])  
[ 0.76625633  0.63062721  0.01954162 ...,  1.1767817   0.249034    0.23544988]
print(type(X))
'numpy.ndarray'

print(y[0])
[1, 2, 3, 4, 5, 6, 7]  # before binary conversion

print(type(y))
'numpy.ndarray'

Best Answer

This question is related to this other one where a principled way of dealing with this problem is given.

Another possibility that I can think of is to generate data that is contained within the convex hull of each class (since in the case of separable classes, the separating hyperplane bisects the shortest vector joining the convex hulls of each class).