Solved – SVM Classification with Duplicate Training Instances

classificationmachine learningsvmtext mining

I'm using SVMs with linear kernel for sentence classification (binary). My dataset contains many duplicate instances i.e. many sentences in the training set have identical feature vectors. In the trained model, I see a lot of duplicate support vectors, each with a different Lagrange Multiplier (alpha*y) term. For example:

-0.064019597363882741 5660:1 # fcda5cf6-e00b-11e2-ddad-001517e11bd0
-0.00028270965981203044 5660:1 # 67d4d4d2-855c-11e3-f88e-001517e11900
-0.00084391434273911603 5660:1 # 38d53d84-e435-11e2-ddad-001517e11bd0
-0.031641060077848393 5660:1 # 069c2c6e-e8be-11e2-edb1-001517e116e4
-0.0028105012661461251 5660:1 # 623c52e4-e038-11e2-edb1-001517e116e4
-0.00098570725949942428 5660:1 # f67aae78-e020-11e2-5884-001517e11900

For those who are unfamiliar with SVMLight model format, each of the above lines represents a support vector, where:

  • 1st term is alpha*y
  • key-value pairs ('5660:1') are feature_number:feature_value. So, all the above training instances contain just one feature with value 1.
  • Stuff following '#' are comments (instance ids in my case)

Why do the above support vectors have different alpha*y values, even when they have identical feature vectors? Also, why does the model contain multiple SVs with same feature vectors in the first place? I had expected all SVs to have unique feature vectors, although the ones which were duplicated would've had larger alpha value. I agree that training instance duplication might shift the margin if duplicate instances fall within the (soft) margin, but I'm unable to understand why would each duplicate support vector have a different alpha*y value?

Best Answer

Just to expand on Justin's answer, In this case, the Lagrange multipliers are not unique - there are an infinite number of combinations of Lagrange multipliers which are optimal, and produce the exact same performance of the SVM. Because the constraints are linear, the Kuhn-Tucker-Kasrush conditions apply, whether or not any of the constraints are redundant (i.e., not full rank). If you used a different optimization solver, you could get other, still optimal, values of the Lagrange multipliers.