Solved – SVM predicts everything in one class

e1071feature selectionmachine learningscikit learnsvm

I'm running a basic language classification task. There are two classes (0/1), and they are roughly evenly balanced (689/776). Thus far, I've only created basic unigram language models and used these as the features. The document term matrix, before any reductions has 125k terms. I've reduced this to ~1250 terms that occur in more than 20% of all documents.

Training on this dataset gives me my best-performing model to date:

library(e1071)
index <- 1:nrow(df.dtm)
testindex <- sample(index, trunc(length(index)/3))
testset <- df.dtm[testindex,]
trainset <- df.dtm[-testindex,]
wts <- 100/table(trainset$labs)
tune.out=tune(svm, labs~., data=trainset, class.weights=wts,
              ranges=list(cost=c(0.001, 0.01, 0.1, 1, 5, 10, 100),
                          gamma=c(0.005,.015, 0.01,0.02,0.03,0.04,0.05)))
bestmod <- tune.out$best.model
ypred<-predict(bestmod, testset)
table(predicted=ypred, truth=testset$labs)

         truth
predicted   0   1
        0  36  29
        1 200 223

As you can see, performance is not good. But at least it's predicting some in the 0 class! In the majority of models I've run so far, performance looks quite a bit worse than this. For instance, the exact same setup, but using tf-idf instead of term frequency:

         truth
predicted   0   1
        0   1   0
        1 236 251

This is more typical of the models I've run. Furthermore, I've had the same results in python using scikitlearn.

I thought maybe there was something fishy with some of the features, so I decided to try taking random subsets of the features and fitting models to those. Here's what happens when I select 10% and run the same model:

         truth
predicted   0   1
        0 116 123
        1 106 143

So okay, performance isn't great, but at least I'm getting some predictions in the 0 class. Why are the predictions so strongly weighted toward one class above when I include all the features?

Is this expected behavior due to poor (/not really any) feature selection? I would have expected that classification would have looked more like a coin flip in that case, not a strong weighting toward selecting one class…

Best Answer

Interesting.. Hard to answer the question directly. Two things I would try to diagnose would be:

1) How do logistic regression and random forest fare?

2) By "fare", I suggest you look at the calibrations of the classifiers. What do the bins look like? Binarized posterior class probabilities will not be very helpful.