Solved – Lower classification rate than expected by chance

classificationcross-validationerrormachine learningsvm

I'm using scikit-learn for a small sample (36) classification problem with three features and three outputs (one output is binary and the other two are ternary).

I'm using separate classifiers for each output, and I'm achieving reasonable classification rates for two outputs, but the remaining one, which is a 3-class problem, is abysmal.

I've tried many different classifiers (SVM, random forest, logistic regression, naive bayes), and using leave-one-out cross-validation the classification rate fluctuates between 0% and 33% (this only happens when the classifier always outputs the same class).

What I don't understand is how it's possible for the classifier to achieve a lower rate then expected by chance. A thought experiment says that when it achieves 0% classification rate, I could ignore the predicted class and randomly choose between the other two, achieving 50% classification rate, which would be better than chance.

Why is this not happening automatically?

Best Answer

Check whether your leave-one-out procedure leaves out a single sample each time or a triplet of samples (one for each class). If it leaves only a single sample, then your training set is always biased against the test sample. That can explain below chance performance.

For example, consider a classifier that instead of looking at the data, simply outputs the most common label of the training set - that would achieve 0% classification rate in a leave one-sample out scheme. You are of course not using such classifier, but classifiers like SVM do have some susceptibility to bias.