Solved – SVM with unequal group sizes in training data

I am trying to build an SVM from training data where one group is represented more than the other. However, the groups will be equally represented in the eventual test data. Therefore, I'd like to use the class.weights parameter of the e1071 R package interface to libsvm to balance the influence of the two groups in the training data.

Since I was unsure exactly how these weights should be specified, I set up a little test:

Generate some null data (random features; 2:1 ratio between group labels)
Fit an svm with the class.weights parameter set.
Predict a bunch of new null datasets and look at the class proportions.
Replicate the whole process many times for different null training sets.

Here is the R code I'm using:

nullSVM <- function(n.var, n.obs) {
    # Simulate null training data
    vars   = matrix(rnorm(n.var*n.obs), nrow=n.obs)
    labels = rep(c('a', 'a', 'b'), length.out=n.obs)
    data   = data.frame(group=labels, vars)

    # Fit SVM
    fit = svm(group ~ ., data=data, class.weights=c(a=0.5, b=1))

    # Calculate the average fraction of 'a' we would predict from null test data
    mean(replicate(50, table(predict(fit, data.frame(matrix(rnorm(n.var*n.obs), nrow=n.obs))))[1])) / n.obs
}

library(e1071)
set.seed(12345)
mean(replicate(50, nullSVM(50, 300)))

From this whole thing I was expecting an output ~ 0.5, however, that's not what I got:

> mean(replicate(50, nullSVM(50, 300)))
[1] 0.6429987

The class.weights paramter is working, sort of, as the lower I weight a, the lower it is represented in this simulation (and if I omit class.weights it returns close to 1)…but I do not understand why simply using weights of 1:2 (for training data that is 2:1) does not get me all the way down to 50%.

If I'm misunderstanding SVMs, can someone explain this point? (or send some refs?)

If I'm doing it wrong, can someone tell me the correct way to use the class.weights parameter?

Could it possibly be a bug? (I think not, since I understand this software and the underlying libsvm to be quite mature)

Best Answer

I think it may depend on the values of C and the number of patterns you have. The SVM tries to find the maximum margin discriminant, so if you have sparse data then it is possible that the SVM might find the hard-margin solution without any of the Lagrange multipliers reaching their upper bounds (in which case the ratio of penalties for each class is essntially irrelevant as the slack-valiables are small or zero. Try increasing the number of training patterns and see if that has an effect (as that makes it less likely that the hard-margin solution can be found within the box-constraints).

More importantly, the optimal values of C are data-dependent, you can't just set them to some pre-determined values, but instead optimise them by minimising the leave-one-out error or some generalisation bound. If you have imbalanced classes, you can fix the ratio of values for each class, and optimise the average penalty over all patterns.

Best Answer

Related Solutions

Solved – Why does GBM predict different values for the same data

Solved – Libsvm one-class svm: how to consider all data to be in-class

Related Question