Solved – RNN with L2 Regularization stops learning

deep learningneural networksrecurrent neural networkregularization

I use Bidirectional RNN to detect an event of unbalanced occurence. The positive class is 100times less often than the negative class. While no regularization use I can get 100% accuracy on train set and 30% on validation set. I turn on l2 regularization and the result is only 30% accuracy on train set too instead of longer learning and 100% accuracy on validation set.

I was thinking that maybe my data is too small so just for experiment I merged train set with test set which I did not use before. Situation was the same as I would use l2 regularization, which I did not now. I get 30% accuracy on train+test and validation.

In use 128hidden units and 80 timesteps in the mentioned experiments When I increased the number of hidden units to 256 I can again overfit on train+test set to get 100% accuracy but still only 30% on validation set.

I did try so many options for hyperparameters and almost no result. Maybe the weighted cross entropy is causing the problem, in given experiments the weight on positive class is 5. While trying larger weights the results are often worse around 20% of accuracy.

I tried LSTM and GRU cells, no difference.

The best results I got. I tried 2 hidden layers with 256 hidden units, it took around 3 days of computation and 8GB of GPU memory. I got around 40-50% accuracy before it starts overfitting again while l2 regularization was on but not so strong.

I use Adam optimizers, others did not work so well. The feature I have is sufficient, because while using state-machine I can get 90% accuracy. In that state machine the main feature is summing and thresholding based on other feature properties and its variable length sometimes it is 10, sometimes 20 timestamps which talks about the feature.

Is there some general guideline what to do in this situation? I was not able to find anything.

Best Answer

The Bengio et al article "On the difficulty of training recurrent neural networks" gives a hint as to why L2 regularization might kill RNN performance. Essentially, L1/L2 regularizing the RNN cells also compromises the cells' ability to learn and retain information through time.

Using an L1 or L2 penalty on the recurrent weights can help with exploding gradients. Assuming weights are initialized to small values, the largest singular value $\lambda_1$ of $W_{rec}$ is probably smaller than 1. The L1/L2 term can ensure that during training $\lambda_1$ stays smaller than 1, and in this regime gradients can not explode. This approach limits the model to single point attractor at the origin, where any information inserted in the model dies out exponentially fast. This prevents the model to learn generator networks, nor can it exhibit long term memory traces.