Solved – Regarding redundant training data in building SVM-based classifier

classificationdata miningkernel trickmachine learningsvm

To build a SVM-based classifier, I have a training data set consisting of N data points. Some of them are redundant. For instance, there have 50 data points which are exactly the same, and there have other 100 data points which are exactly the same. I have two choices, remove the redundant ones and construct the reduced data set; keep the original data set. Will the resulting classifier be different after applying these two different choices?

Best Answer

If you are using hard margins, there is no difference because the best margin is the same either way.

If you are using soft margins, then duplicating a data point can matter since the penalty is a sum over data points within the margin, and duplicating these data points affects the size of the penalty.

Here are $1$-dimensional pictures showing what might be the best soft-margin classifiers without and with duplication.

$XXX~~~~~~~~~~X~|~~~~~~~~~~~OOOO$

$XXX~~~~~~XXX~~~~~|~~~~~OOOO$