Solved – Practical Question about the Assumptions of Support Vector Machines

assumptionsdata miningsvm

As far as I know, the only assumptions of support vector machines are independent and identically distributed data. I am planning to train and run a SVM on a number of variables that aren't naturally on the same scale. To meet the assumption of identically distributed data, I was planning on standardizing the variables; however, I'm not sure if I should do this for the training set and test set individually, or for the overall sample prior to creating the two sets of data.

It seems to me that it would be better to standardize the training set and test set individually, but I have no evidence to back this up and no citation to point to. Does anyone know if this is true? Also, is there a citation on this topic?

Best Answer

The correct procedure is to scale the data separately in the following way:

  1. Divide training and test data.
  2. For the training data, center and scale the data. Retain the values of the centering and scaling.
  3. Using the values from (2), subtract the center from the test data and divide by the scale.

A reference for this in the context of support vector machines can be found here, "A Practical Guide to Support Vector Classification" Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin Department of Computer Science National Taiwan University.

The reason for this is that we do not want information to spill over between the test and training sets. This is true in general of any ML procedure, not just SVM.