Solved – Data normalization for RBF kernel

kernel tricknormalizationsvm

I have a matrix of values where rows are individuals and columns are attributes. I want to extract a similarity value for every pair of individuals, and I use an rbf kernel:
$$k(x_i,x_j) = \exp\left(-\gamma\|x_i-x_j\|^2\right),$$
where $\gamma = \frac{1}{(2\sigma)^2}$.

Since any attribute has its own range, I suppose that a normalization step is necessary to get a sound similarity value.

I divided each value in column (attribute) $i$ by the norm of column $i$, but now as output from the RBF kernel I get values very near to $1$, and I should use a very "high" $\gamma$ value ($\approx$ 500) to spread out the similarity values between $0$ (not similar) and $1$ (similar).

Is this kind of data normalization "sound" for RBF kernel?

Should I normalize the rows (individuals) rather than the columns (attributes)?

Best Answer

As you can see in the formula, RBF uses Euclidean distance along its calculations. Do you have any reasons to believe that Euclidean distance accurately captures notion of a distance in the data space? I doubt that.

There are several reasons Euclidean distance is not as good as we'd like it to be:

  1. Different features may have different scales. If one feature is, say, distance from one city to another in meters, and the other one is height of an object in meters, then clearly the first one would affect the distance much more than the later one.

  2. Features may be correlated. Suppose an extreme case when one features is replicated several times (that means it has correlation of 1 with copies): $(x, y, y, y, y) \in \mathbb{R}^5$. This is essentially $\mathbb{R}^2$ space "embedded" into $\mathbb{R}^5$. So, according to the $\mathbb{R}^5$-distance $(2, 0, 0, 0, 0)$ is closer to the origin $(0, 0, 0, 0, 0)$, than $(0, 1, 1, 1, 1)$. But in $\mathbb{R}^2$ it'd be the other way round!

So how can you normalize your data to address there issues? The answer is whitening. Basically, you transform your data by linear transformation $M$ so that resultant covariance matrix is an identity matrix:

$$ I = \mathbb{E}[(M X) (M X)^T] = M \mathbb{E}[X X^T] M^T \Rightarrow M^{-1} M^{-T} = \mathbb{E}[X X^T] $$

Covariance matrix is symmetric, so we might expect $M$ to be symmetric as well, thus having

$$ M^{-2} = \mathbb{E}[X X^T] \Rightarrow M = \mathbb{E}[X X^T]^{-1/2} $$

P.S. Of course, you'd like to center your data first.