Solved – Regularization and feature scaling in online learning

machine learningnormalizationonline-algorithmsregularization

Let's say I have a logistic regression classifier. In normal batch learning, I'd have a regularizer term to prevent overfitting and keep my weights small. I'd also normalize and scale my features.

In an online learning setting, I'm getting a continuous stream of data. I do a gradient descent update with each example and then discard it. Am I supposed to use feature scaling and regularization term in online learning? If yes, how can I do that? For example, I don't have a set of training data to scale against. I also don't have validation set to tune my regularization parameter. If no, why not?

In my online learning, I get a stream of examples continuously. For each new example, I do a prediction. Then in the next time step, I get the actual target and do the gradient descent update.

Best Answer

The open-source project vowpal wabbit includes an implementation of online SGD which is enhanced by on the fly (online) computation of 3 additional factors affecting the weight updates. These factors can be enabled/disabled by their respective command line options (by default all three are turned on, the --sgd option, turns them all off, i.e: falls-back on "classic" SGD).

The 3 SGD enhancing options are:

  • --normalized updates adjusted for scale of each feature
  • --adaptive uses adaptive gradient (AdaGrad) (Duchi, Hazan, Singer)
  • --invariant importance aware updates (Karampatziakis, Langford)

Together, they ensure that the online learning process does a 3-way automatic compensation/adjustment for:

  • per-feature scaling (large vs small values)
  • per-feature learning rate decay based on feature importance
  • per feature adaptive learning rate adjustment for feature prevalence/rarity in examples

The upshot is that there's no need to pre-normalize or scale different features to make the learner less biased and more effective.

In addition, vowpal wabbit also implements online regularization via truncated gradient descent with the regularization options:

  • --l1 (L1-norm)
  • --l2 (L2-norm)

My experience with these enhancements on multiple data-sets, was that they significantly improved model accuracy and smoother convergence when each of them was introduced into the code.

Here are some academic papers for more detail related to these enhancements:

Related Question