Solved – data normalization after dimension reduction for classification

classificationdata preprocessingdimensionality reductionmathematical-statisticsnormalization

The classifier is KNN or RBF-SVM. After doing dimension reduction (e.g., PCA, LDA or KPCA, KLDA), does it need to do normalization before classification?

In LIBSVM package, it always needs to first use svm-scale to normalize the features using min-max normalization, then takes the normalized features as inputs for svm-train.

I'm not sure whether the data normalization would harm the structures of the transformed features by PCA, LDA etc.

Best Answer

PCA does require normalization as a pre-processing step.

Normalization is important in PCA since it is a variance maximizing exercise. It projects your original data onto directions which maximize the variance. Source: here

Would a further step of data normalization harm the data?

No, it would not harm the data. But would it be really necessary?

import numpy as np
from sklearn.decomposition import PCA

mean = [0.0, 20.0]
cov = [[1.0, 0.7], [0.7, 1000]]
values = np.random.multivariate_normal(mean, cov, 1000)

pca = PCA(n_components=1, whiten=True)
pca.fit(values)

values_ = pca.transform(values)
print np.var(values_)

The following exercise returns 1.0

Why? We are projecting two whitened features onto the first component. Let's assume that a point in the whitened space is identified by a vector ($a$) The new vector ($a'$) is the result of the transformation $$a' = |a| * \cos(\theta) = a \cdot \hat{b} $$

where we have $|a|$ is the length of $a$; and $\theta$ is the angle between the vector $a$ and the vector we are projecting onto. In this case $b$ equals $e$, the eigenvectors, that maps each row vector onto the principal component.

What is the variance of the whitened feature once projected on the principal component?

$$\sigma^2 = \frac{1}{n} \sum^n (a_i \cdot e)^2 = e^T \frac{a^Ta}{n} e$$

$e^Te = 1$ by definition (eigenvectors are unit vectors). Note that when we whitened the data, we imposed that means are zero on the feature set.

Related Question