Solved – Support Vector Regression and Data Rescaling

machine learningmultidimensional scalingnormalizationsvm

I am currently working on Support Vector Regression and I've read that it is recommended to implement data rescaling, e.g. to interval $[-1;1]$, to obtain better results.

My first question is: should rescaling be applied only to features/variables, or also to the response vector?

My second question concerns rescaling and training/test sets. In http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf, page 4, rescaling is done independently on the training and test set, i.e. first we split both sets and then we rescale (while using the same method in both sets). But would it be correct to first rescale the whole data set and then split it?

Best Answer

Question 1

SVM is binary classifier, so you can normalize to 0 or 1. For SVR, it is not neccesary for the response vector to normalize but the feature/variable(s) has to be normalized. Reason is the C-value which controls the model complexity is over the feature/variable(s).

Question 2

In real scenario, you will have only training set, your test set is unseen, there might be some outliers how you'll normalize? So the correct method is to split the data first, then normalize training set and then use the same normalization variables to normalize the test set.

However, if you know the normalization variables then you can normalize it directly, even, if you don't have them in the training dataset. For example, If you are using RGB values of images, for 24 bit it is known that 0 is the minimum and 255 is maximum, you can do min-max normalization directly without actually calculating them in training or testing data. Similarly, if you know the distribution you can z-score normalize.