Solved – Output of Scikit SVM in multiclass classification always gives same label

libsvmmulti-classoptunityscikit learnsvm

I am currently using Scikit learn with the following code:

clf = svm.SVC(C=1.0, tol=1e-10, cache_size=600, kernel='rbf', gamma=0.0, 
              class_weight='auto')

and then do fit and predict for a set of data with 7 different labels. I got a weird output. No matter which cross validation technique I use the predicted label on the validation set is always going to be label 7.

I try some other parameters, including the full default one (svm.SVC()) but as long as the kernel method I use is rbf instead of poly or linear it just would not work, while it work really fine for poly and linear.

Besides I have already try prediction on train data instead of validation data and it perfectly fit.

Does anyone see this kind of problem before and know what is going on here?

I never look at my class distribution in detail but I know it should be around 30% of them are 7, 14% are 4.

I even try a manual 1-vs-rest implementation and it is still not helpful.

Best Answer

A likely cause is the fact you are not tuning your model. You need to find good values for $C$ and $\gamma$. In your case, the defaults turn out to be bad, which leads to trivial models that always yield a certain class. This is particularly common if one class has much more instances than the others. What is your class distribution?

scikit-learn has limited hyperparameter search facilities, but you can use it together with a tuning library like Optunity. An example about tuning scikit-learn SVC with Optunity is available here.

_{Disclaimer: I am the lead developer of Optunity.}

Related Solutions

Solved – probablistic output for binary SVM classification

The usual approach is to use Platt's method of fitting a univariate logistic regression model to the output of the SVM. However, if you want a probabilistic output, it is probably better to go for kernel logistic regression, which estimates the probabilities directly, rather than training a discriminative classifier and post-processing the output.

Gaussian process classification would also be another method that may be better suited, see the excellent book by Rasmussen and Williams, and the equally excellent MATLAB toolbox that goes with it.

Solved – scikit multi label classification

The Multi-label algorithm accepts a binary mask over multiple labels. So, for example, you could do something like this:

data = [
        [[0.1 , 0.6, 0.0, 0.3], 1, 10, 0, 0, 0],
        [[0.7 , 0.3, 0.0, 0.0], 0, 7, 22, 0, 0],
        [[0.0 , 0.0, 0.6, 0.4], 0, 0, 6, 0, 20],
        #...
       ]

X = np.array([d[1:] for d in data])
yvalues = np.array([d[0] for d in data])

# Create a binary array marking values as True or False
from sklearn.preprocessing import MultiLabelBinarizer
Y = MultiLabelBinarizer().fit_transform(yvalues)

clf = OneVsRestClassifier(SVC(kernel='poly'))
clf.fit(X, Y)
clf.predict(X)  # predict on a new X

The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample.

Given your data, though, I'm not sure this is what you want to do. For example, the third point has zero listed twice, which makes me think that you're not predicting multiple labels in an unordered OneVsRest manner, but actually predicting multiple ordered columns of labels: in that case, it might make sense to do a separate classification for each, e.g.

X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
clfs = [SVC().fit(X, Y[:, i]) for i in range(Y.shape[1])]
Ypred = np.array([clf.predict(X) for clf in clfs]).T

With other classifiers, such as RandomForestClassifier, you can do this column-by-column prediction in one operation: e.g.

X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
RandomForestClassifier().fit(X, Y).predict(X)

Of course, the array passed to predict should be on something different than the array passed to fit, but hopefully this makes the distinction clear.

Best Answer

Related Solutions

Solved – probablistic output for binary SVM classification

Solved – scikit multi label classification

Related Question