Solved – Regularized logistic regression and support vector machine

classificationlibsvmmachine learningsvm

L2 regularized logistic regression differs with L2 regularized support vector machine with their loss function. Are there more deep differences for these two models? I tried several data sets, and found L2 regularized logistic regression are always better than L2 regulared support vector machine. Are there any references/researches for discussion their differences in terms of classification performance?

Best Answer

The no "no free lunch" theorems suggest that there is no a-priori superiority of one classifier over another, and which works best depends on the nature of the learning task. So for some datasets the SVM will give better performance, and for others it will be regularised logistic regression.

The real distinction between the two classifiers is that the SVM is purely discriminative (designed to give a binary classification) and RLR provides estimates of a-posteriori probability of class membership. So which you use really depends on what you want to know. Prof Vapnik suggests that we should always aim to solve a problem directly, rather than solving a more general problem, and then simplifying it to get the answer to our particular question. Following this maxim if we want a binary classifier, we should aim to identify the decision boundary directly, rather than estimate the probability of class membership and then threshold at p = 0.5. The reason for this is that fitting a model is always a compromise between use of resources and performance, if you are only interested in the decision boundary, then you should spend all your resources optimising the accuracy of the classifier near the decision boundary, rather than on ensuring the classifier gives a probability of 0.9 instead of 0.8 for some example that is reliable classified correctly. I agree with this intuition, and I tend to use (kernel) logistic regression quite a lot as I often want the probability of class membership, for example because class frequencies are different in operation and in the training set, or because misclassification costs are unknown or variable etc.

Note the SVM is an approximate implementation of a "worst case" bound on generalisation performance, whereas RLR is more based on "average case" considerations. So it might be argued that the SVM is likely to be better in pathological cases, but perhaps at the expense of how well it works on more benign datasets (compared with e.g. RLR).

I think the main reason the SVM out-performed the previous generation of learning methods is that it forces the user to think about regularising the model properly. As RLR also incorporates regularisation, I suspect the practical differences in performance are likely to be quite small, provided a reliable model selection procedure is adopted.

I disagree with Marc Claesen somewhat, sparse KLR algorithms are available that are pretty efficient (such as the "Informative Vector Machine) and the efficiency of SVM is often overstated. Often the best kernel and regularisation parameters for a SVM give a dense solution, where almost all of the data are support vectors, in which case the training procedure can be very slow. I generally think it is better to get a good answer slowly than a less satisfactory answer quickly, so I wouldn't put computational efficiency ahead of more basic issues, such as whether you want probabilities of class membership or not.

BTW, Gaussian Processes are essentially a Bayesian form of RLR, and a are a very interesting family of machine learning algorithms.