PCA does require normalization as a pre-processing step.
Normalization is important in PCA since it is a variance maximizing
exercise. It projects your original data onto directions which
maximize the variance. Source: here
Would a further step of data normalization harm the data?
No, it would not harm the data. But would it be really necessary?
import numpy as np
from sklearn.decomposition import PCA
mean = [0.0, 20.0]
cov = [[1.0, 0.7], [0.7, 1000]]
values = np.random.multivariate_normal(mean, cov, 1000)
pca = PCA(n_components=1, whiten=True)
pca.fit(values)
values_ = pca.transform(values)
print np.var(values_)
The following exercise returns 1.0
Why? We are projecting two whitened features onto the first component.
Let's assume that a point in the whitened space is identified by a vector ($a$)
The new vector ($a'$) is the result of the transformation
$$a' = |a| * \cos(\theta) = a \cdot \hat{b} $$
where we have $|a|$ is the length of $a$; and $\theta$ is the angle between the vector $a$ and the vector we are projecting onto. In this case $b$ equals $e$, the eigenvectors, that maps each row vector onto the principal component.
What is the variance of the whitened feature once projected on the principal component?
$$\sigma^2 = \frac{1}{n} \sum^n (a_i \cdot e)^2 = e^T \frac{a^Ta}{n} e$$
$e^Te = 1$ by definition (eigenvectors are unit vectors). Note that when we whitened the data, we imposed that means are zero on the feature set.
why are you applying normalisation? Is it because you believe it to be a necessary step or is it because you have determined based on your data that it is appropriate?
Mean centring and scaling to unit variance is commonly useful, but not universally so and so you should think about the properties of your data.
Mean centring is rarely not useful, but may be less useful for highly skewed populations where subtracting the mean is not significantly accounting for large proportions of the variance in the dataset. Median or mode centering are less common solutions but may work if they reduce total variance more than the mean.
Unit variance is less useful if the data is all on the same dynamic range and noise is correlated with magnitude. In such scenarios scaling to unit variance will magnify the apparent magnitude of variance from a low amplitude signal should be retained as lower than the variance in a high amplitude signal.
I realise this link is specifically about PCA, but it discusses when unit scaling is and isn't helpful and the lessons are more generalisable. Note, to help interpret this link bear in mind that using variance scaled data creates a correlation matrix in the first step of PCA and non-scaled data a covariance matrix. PCA on correlation or covariance?
Best Answer
There isn't a hard-and-fast rule about which is better; this is context-dependent. For example, people training auto-encoders for MNIST commonly use $[0,1]$ scaling and use a variant of the log-loss; you can't use the log-loss variant in conjunction with $z$ scaling because taking the log of a negative number doesn't yield a real number. On the other hand, different problems might favor different scaling schemes for similarly idiosyncratic reasons.
Scaling is important because it preconditions the data to facilitate optimization. Putting the features on the same scale stretches the optimization surface to ameliorate narrow valleys, because these valleys make optimization very challenging, especially optimization using gradient descent. A choice of scaling is "correct" to the extent that your choice of scaling makes optimization go more smoothly. Using a scaling method that produces values on both sizes of zero, such as $z$ scaling or $[-1,1]$ scaling is preferred (if you're not in a setting similar to that of using BCE loss for an auto-encoder). From the Neural Network FAQ:
A second benefit of scaling is that it can prevent units from saturating early in training. Sigmoid, tanh and softmax functions have horizontal asymptotes, so very large and very small inputs have small gradients. If training starts with these units at saturation, then optimization will proceed more slowly because the gradients are so shallow. (Effect of rescaling of inputs on loss for a simple neural network)
Which scaling method works best depends on the problem, because different problems have different optimization surfaces. A very general strategy is to carry out an experiment: test how well the model works with alternative methods. This can be expensive, though, since the scaling will interact with other model configuration choices, such as the learning rate, effectively meaning that you'll be testing all model configurations for all scaling choices. This can be tedious, so it's typical to pick a simple method that works "well enough" for some problem and focus on more interesting considerations.
Scaling using the min and max can be extremely sensitive to outliers: if there is even one value orders of magnitude larger or smaller than the rest of the data, then the denominator is very large. As a result, scaling will clump the rest of the data in a narrow segment of the $[0,1]$ or $[-1,1]$ interval, so the range used by most of the data is much narrower.
A single large outlier will strongly influence the denominator of the scaling even for $z$ scales, but the larger the sample size, the less and less that influence is present. On the other hand, methods using the max and min will always be strongly influenced by a single outlier. And as the FAQ quotation notes, robust estimators will be more effective; unbiasedness isn't really a concern for this application.