My features' every dimension has different range of value. I want to know if it is essential to normalize this dataset.

# Solved – Is it essential to do normalization for SVM and Random Forest

machine learningnormalizationrandom forestsvm

#### Related Solutions

Store the mean and standard deviation of the training dataset features. When the test data is received, normalize each feature by subtracting its corresponding training mean and dividing by the corresponding training standard deviation.

Normalizition by min/max is usually a very bad idea since it involves scaling your entire data according to two particular observations. This leads your scaling to be dominated by noise. mean/std is a standard procedure and you can even experiment with more robust measures (e.g. median/MAD)

Why scale/normalize? Because of the way the SVM optimization problem is defined, features with higher variance have greater effect on the margin. Usually this doesn't make sense - we'd like our classifier to be 'unit invariant' (e.g. a classifier that combines patients' weight and height shouldn't be affected by the choice of units - kgs or grams, centimeters or meters).

However, I guess that there might be cases in which all of the features are given in the same units and the differences in their variance indeed reflect differences in importance. In such case I'd try to skip scaling/normalization and see what it does to the performance.

As you can see in the formula, RBF uses Euclidean distance along its calculations. Do you have any reasons to believe that Euclidean distance accurately captures notion of a distance in the data space? I doubt that.

There are several reasons Euclidean distance is not as good as we'd like it to be:

Different features may have different scales. If one feature is, say, distance from one city to another in meters, and the other one is height of an object in meters, then clearly the first one would affect the distance much more than the later one.

Features may be correlated. Suppose an extreme case when one features is replicated several times (that means it has correlation of 1 with copies): $(x, y, y, y, y) \in \mathbb{R}^5$. This is essentially $\mathbb{R}^2$ space "embedded" into $\mathbb{R}^5$. So, according to the $\mathbb{R}^5$-distance $(2, 0, 0, 0, 0)$ is closer to the origin $(0, 0, 0, 0, 0)$, than $(0, 1, 1, 1, 1)$. But in $\mathbb{R}^2$ it'd be the other way round!

So how can you normalize your data to address there issues? The answer is whitening. Basically, you transform your data by linear transformation $M$ so that resultant covariance matrix is an identity matrix:

$$ I = \mathbb{E}[(M X) (M X)^T] = M \mathbb{E}[X X^T] M^T \Rightarrow M^{-1} M^{-T} = \mathbb{E}[X X^T] $$

Covariance matrix is symmetric, so we might expect $M$ to be symmetric as well, thus having

$$ M^{-2} = \mathbb{E}[X X^T] \Rightarrow M = \mathbb{E}[X X^T]^{-1/2} $$

P.S. Of course, you'd like to center your data first.

## Best Answer

The answer to your question depends on what similarity/distance function you plan to use (in SVMs). If it's simple (unweighted) Euclidean distance, then if you don't normalize your data you are unwittingly giving some features more importance than others.

For example, if your first dimension ranges from 0-10, and second dimension from 0-1, a difference of 1 in the first dimension (just a tenth of the range) contributes as much in the distance computation as two wildly different values in the second dimension (0 and 1). So by doing this, you're exaggerating small differences in the first dimension. You could of course come up with a custom distance function or weight your dimensions by an expert's estimate, but this will lead to a lot of tunable parameters depending on dimensionality of your data. In this case, normalization is an easier path (although not necessarily ideal) because you can at least get started.

Finally, still for SVMs, another thing you can do is come up with a similarity function rather than a distance function and plug it in as a kernel (technically this function must generate positive-definite matrices). This function can be constructed any way you like and can take into account the disparity in ranges of features.

For random forests on the other hand, since one feature is never compared in magnitude to other features, the ranges don't matter. It's only the range of one feature that is split at each stage.