Solved – Is it essential to do normalization for SVM and Random Forest

machine learningnormalizationrandom forestsvm

My features' every dimension has different range of value. I want to know if it is essential to normalize this dataset.

Best Answer

The answer to your question depends on what similarity/distance function you plan to use (in SVMs). If it's simple (unweighted) Euclidean distance, then if you don't normalize your data you are unwittingly giving some features more importance than others.

For example, if your first dimension ranges from 0-10, and second dimension from 0-1, a difference of 1 in the first dimension (just a tenth of the range) contributes as much in the distance computation as two wildly different values in the second dimension (0 and 1). So by doing this, you're exaggerating small differences in the first dimension. You could of course come up with a custom distance function or weight your dimensions by an expert's estimate, but this will lead to a lot of tunable parameters depending on dimensionality of your data. In this case, normalization is an easier path (although not necessarily ideal) because you can at least get started.

Finally, still for SVMs, another thing you can do is come up with a similarity function rather than a distance function and plug it in as a kernel (technically this function must generate positive-definite matrices). This function can be constructed any way you like and can take into account the disparity in ranges of features.

For random forests on the other hand, since one feature is never compared in magnitude to other features, the ranges don't matter. It's only the range of one feature that is split at each stage.

Related Question