Solved – How to normalize data in online learning

k nearest neighbournormalizationonline-algorithms

In offline machine learning, the data normalization of features with different units seem to be simple, we can apply this formula.

But, when using incremental learning (weighted kNN in my case) new instances will be added to the initial training set, so do we use the same formula? if yes which max and min should I use (those of the original training set or the new one)?

Best Answer

In an ideal world, our training data should be representative of the production data, which means that the descriptive statistics (such as the mean, max, or min) should not change too much. Thus, in an "online-learning" environment, we should be able to use the max and min value from the historical training data to do the normalization.

If the training data is not representative of the production data, or we do not know how production data is distributed, the answer is 1. collect data; 2. do "training off line;" and then put into production.

Related Solutions

Solved – Regularization and feature scaling in online learning

The open-source project vowpal wabbit includes an implementation of online SGD which is enhanced by on the fly (online) computation of 3 additional factors affecting the weight updates. These factors can be enabled/disabled by their respective command line options (by default all three are turned on, the --sgd option, turns them all off, i.e: falls-back on "classic" SGD).

The 3 SGD enhancing options are:

--normalized updates adjusted for scale of each feature
--adaptive uses adaptive gradient (AdaGrad) (Duchi, Hazan, Singer)
--invariant importance aware updates (Karampatziakis, Langford)

Together, they ensure that the online learning process does a 3-way automatic compensation/adjustment for:

per-feature scaling (large vs small values)
per-feature learning rate decay based on feature importance
per feature adaptive learning rate adjustment for feature prevalence/rarity in examples

The upshot is that there's no need to pre-normalize or scale different features to make the learner less biased and more effective.

In addition, vowpal wabbit also implements online regularization via truncated gradient descent with the regularization options:

--l1 (L1-norm)
--l2 (L2-norm)

My experience with these enhancements on multiple data-sets, was that they significantly improved model accuracy and smoother convergence when each of them was introduced into the code.

Here are some academic papers for more detail related to these enhancements:

Online Importance Weight Aware Updates by Nikos Karampatziakis and John Langford + Slides of talk about this paper
Sparse online learning via truncated gradient by John Langford, Lihong Li, and Tong Zhang
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization by John Duchi, Elad Hazan, Yoram Singer

Solved – data normalization after dimension reduction for classification

PCA does require normalization as a pre-processing step.

Normalization is important in PCA since it is a variance maximizing exercise. It projects your original data onto directions which maximize the variance. Source: here

Would a further step of data normalization harm the data?

No, it would not harm the data. But would it be really necessary?

import numpy as np
from sklearn.decomposition import PCA

mean = [0.0, 20.0]
cov = [[1.0, 0.7], [0.7, 1000]]
values = np.random.multivariate_normal(mean, cov, 1000)

pca = PCA(n_components=1, whiten=True)
pca.fit(values)

values_ = pca.transform(values)
print np.var(values_)

The following exercise returns 1.0

Why? We are projecting two whitened features onto the first component. Let's assume that a point in the whitened space is identified by a vector ($a$) The new vector ($a'$) is the result of the transformation $$a' = |a| * \cos(\theta) = a \cdot \hat{b} $$

where we have $|a|$ is the length of $a$; and $\theta$ is the angle between the vector $a$ and the vector we are projecting onto. In this case $b$ equals $e$, the eigenvectors, that maps each row vector onto the principal component.

What is the variance of the whitened feature once projected on the principal component?

$$\sigma^2 = \frac{1}{n} \sum^n (a_i \cdot e)^2 = e^T \frac{a^Ta}{n} e$$

$e^Te = 1$ by definition (eigenvectors are unit vectors). Note that when we whitened the data, we imposed that means are zero on the feature set.

Best Answer

Related Solutions

Solved – Regularization and feature scaling in online learning

Solved – data normalization after dimension reduction for classification

Related Question