Solved – L1-norm cost function for Neural Network. (Regression)

deep learninggradient descentmachine learningneural networksregression

I am trying to build a regression model using a neural network. The final cost measure is the mean absolute error (MAE) on the output (one output unit, 200 input units).

Right now all my hidden units have rectifier activation. The output unit is just a linear unit with pass-through activation. It seems the network cannot learn efficiently, the error (even on can't find a value that makes the error go down monotonically).

I suspect the cost function (L1-norm) might be the culprit. Right now, when taking the gradient, I either pass 1 or -1 depending on predicted value vs actual output value. Is this the right way? (Since L1 is not smooth at 0, would this be the reason why the learning is not smooth/effective?) What is the right way to handle a L1-norm cost function?

Thanks, any help is appreciated!

Best Answer

I am pretty sure it is not the L1 cost function. Neural nets are pretty robust when it comes to only locally differentiable things. To make really sure, you can use the L2 loss and see if it has similar problems. That being said, my experience is that L1 and L2 based objectives find similar solutions most of the time.

Here are some things that you should investigate:

  • Try to initialize the parameters in different ranges (e.g. normally distributed around zero with a standard deviation of 1e-6,1e-5, ..., 1e-1.).
  • Reduce the learning rate, e.g. 1e-6,1e-5, ..., 1e-1. (I suppose you are using SGD).
  • In case you are using sth more sophisticated, especially second order such as LBFGS, use either stochastic gradient descent, rprop, rmsprop or adadelta. (These methods are not bad per se, but in some cases work horrible. As the other optimizers do in other cases.)
  • Have you tried mini batches? That is calculating the loss not on the whole data set or a single sample, but on a group of samples (e.g. 10, 50, 200--depends on your data set size). This can help a lot and is the most underestimated hyper parameters.