Solved – Does regularization leads to stucking in local minima

deep learningneural networksoptimizationregularization

I frequently hear some very conflicting claims regarding deep learning algorithms. Currently, I am a bit confused on the role of regularization. I have listed my queries below regarding regularization and all queries are inter-related to each other:

  1. We use regularization in deep neural nets to make generalization error small (since overfitting leads to an increase in generalization error). So, basically regularization is used to control overfitting. Does overfitting means the optimization/training algorithm gets stuck in local minima?

  2. When we use weight decay, it penalizes the large weights and direct the weights towards smaller values. When the values of weights become very small, doesn't it hinder in learning process or makes the optimization algorithm to stuck in local minima? If the algorithm is not learning because of smaller weights, can we say its stuck in local minima?

  3. Do the regularization strategies make the convergence of optimization algorithm faster? If yes, how do they?

Best Answer

Your description of regularization is correct but overfitting and getting stuck in a local minimum are not the same thing. You could get stuck in a local minimum yet generalize well.

A good way to think about the difference between overfitting and local minima is to think of the following setup: imagine you have a problem where the solution space has a lot of local minima and doesn't have a clear optimal gradient direction to the solution. You set up three identical neural networks in terms of amount of layers and parameters. However the second network adds weight decay and the third adds both weight decay as well as using an evolution strategy to calculate and optimize the gradients (rather than backpropagation with gradient descent). While both the second and third network should generalize better than the first, the third might also be able better escape local minima and reach a better global solution.

Also, regularization strategies don't necessarily converge faster, it depends very much on the problem and strategy.