The open-source project vowpal wabbit includes an implementation of online SGD which is enhanced by on the fly (online) computation of 3 additional factors affecting the weight updates. These factors can be enabled/disabled by their respective command line options (by default all three are turned on, the --sgd
option, turns them all off, i.e: falls-back on "classic" SGD).
The 3 SGD enhancing options are:
--normalized
updates adjusted for scale of each feature
--adaptive
uses adaptive gradient (AdaGrad) (Duchi, Hazan, Singer)
--invariant
importance aware updates (Karampatziakis, Langford)
Together, they ensure that the online learning process does a 3-way automatic compensation/adjustment for:
- per-feature scaling (large vs small values)
- per-feature learning rate decay based on feature importance
- per feature adaptive learning rate adjustment for feature prevalence/rarity in examples
The upshot is that there's no need to pre-normalize or scale different features to make the learner less biased and more effective.
In addition, vowpal wabbit also implements online regularization via truncated gradient descent with the regularization options:
--l1
(L1-norm)
--l2
(L2-norm)
My experience with these enhancements on multiple data-sets, was that they significantly improved model accuracy and smoother convergence when each of them was introduced into the code.
Here are some academic papers for more detail related to these enhancements:
PCA does require normalization as a pre-processing step.
Normalization is important in PCA since it is a variance maximizing
exercise. It projects your original data onto directions which
maximize the variance. Source: here
Would a further step of data normalization harm the data?
No, it would not harm the data. But would it be really necessary?
import numpy as np
from sklearn.decomposition import PCA
mean = [0.0, 20.0]
cov = [[1.0, 0.7], [0.7, 1000]]
values = np.random.multivariate_normal(mean, cov, 1000)
pca = PCA(n_components=1, whiten=True)
pca.fit(values)
values_ = pca.transform(values)
print np.var(values_)
The following exercise returns 1.0
Why? We are projecting two whitened features onto the first component.
Let's assume that a point in the whitened space is identified by a vector ($a$)
The new vector ($a'$) is the result of the transformation
$$a' = |a| * \cos(\theta) = a \cdot \hat{b} $$
where we have $|a|$ is the length of $a$; and $\theta$ is the angle between the vector $a$ and the vector we are projecting onto. In this case $b$ equals $e$, the eigenvectors, that maps each row vector onto the principal component.
What is the variance of the whitened feature once projected on the principal component?
$$\sigma^2 = \frac{1}{n} \sum^n (a_i \cdot e)^2 = e^T \frac{a^Ta}{n} e$$
$e^Te = 1$ by definition (eigenvectors are unit vectors). Note that when we whitened the data, we imposed that means are zero on the feature set.
Best Answer
In an ideal world, our training data should be representative of the production data, which means that the descriptive statistics (such as the mean, max, or min) should not change too much. Thus, in an "online-learning" environment, we should be able to use the max and min value from the historical training data to do the normalization.
If the training data is not representative of the production data, or we do not know how production data is distributed, the answer is 1. collect data; 2. do "training off line;" and then put into production.