Solved – Final layer of neural network responsible for overfitting

boostinggradient descentneural networks

I am using a multi-layer perceptron with 2 hidden layers to solve a binary classification task on a noisy timeseries dataset with a class imbalance of 80/20. I have 30 million rows and 500 features in the training set. The dataset is structured, ie, not images. My original features were highly right skewed; I do my best to transform these into nicer distributions by either taking logs or categorising some of them. I use an architecture of 512->128->128->1, with relu activations in every layer except the last. My loss function is sigmoid cross entropy.

The validation set contains 10 million rows. Initially the validation error goes down, but then starts to go up again after a couple of epochs. On analysing the gradients and weights of each layer, I see that the overfitting coincides with the weights on the final layer only getting larger and larger. The final layer seems to go into overdrive while the rest of the network seems to do very little learning.

I can solve the overfitting problem by using l2 regularisation, but this hurts the validation error. I have yet to find a beta regularisation parameter which doesn't hurt the best validation error I've seen. Dropout makes things even worse.

Granted, the classification problem is very difficult, with probably a very weak signal, but I find that gradient boosted trees are able to generalize much better than a simple, say, 64×64 multi-layer perceptron (the log loss on the training set is the same for both network and gradient boosted tree).

Are there any words of wisdom on how to make this network generalize better given that I've already tried:

  • dropout of varying degrees
  • l1/l2/group lasso regularization
  • adding noise to inputs
  • adding noise to gradients and weights
  • feature-engineering so as to remove/re-represent highly skewed features
  • batch normalization
  • using a lower learning rate on the final layer
  • simply using a smaller network (this is the best solution I've found)

to some or all layers. All methods hurt the validation error so much that the performance is nowhere near how well the tree model does. I would have given up by now were it not for the fact that the tree model is able to do so much better out of sample, but the training log loss for both is the same.

Best Answer

With sample size $N=30\times10^6$ and 500 features, you already tried (most of) the usual regularization tricks, thus it doesn't look like there's much left to do at this point.

However, maybe the problem here is upstream. You haven't told us what's your dataset, exactly (what are the observations? What are the features?) and what are you trying to classify. You also don't describe in detail your architecture (how many neurons do you have? which activation functions are you using? What rule do you use to convert the output layer result into a class choice?). I will proceed under the assumptions that:

  • you have 512 units in input layer, 512 units in each of the hidden layers and 2 units in the output layer. corresponding to $p=525312$ parameters. In this case, your data set seems large enough to learn all weights.
  • you're using One-Hot Encoding to perform classification.

Correct me if my assumptions are wrong. Now:

  1. if you have structured data (this means you're not doing image classification), maybe there's just nothing you can do. Usually XGboost just beats DNNs on structured data classification. Have a look the Kaggle competitions: you'll see that for structured data, usually the winning teams use ensembles of extreme gradient boosted trees, not Deep Neural Networks.
  2. if you have unstructured data, then something's weird: usually DNNs dominate XGboost here. If you're doing image classification, don't use an MLP. Mostly everyone now uses a CNN. Also, be sure you don't use sigmoid activation functions, but stuff such as ReLU.
  3. You didn't try early stopping and learning rate decay. Early stopping usually "plays nice" with most other regularization methods and it's easy to implement, so that's the first thing I'd try, if I were in you. In case you're not familiar with early stopping, read this nice answer: Early stopping vs cross validation
  4. If nothing else helps, you should check for errors in your code. Can you try to write unit tests? If you're using Tensorflow, Theano or MXNet, can you switch to an high level API such as Keras or PyTorch? One might expect that using an high level API, where less customization is possible, would drive your test error up, not down. However, often the opposite happens, because the higher level API allows you to do the same work with much less code, and thus much less opportunity for mistakes. At the very least, you can be sure your high test error isn't due to coding bugs....

Finally, I didn't add anything about dealing with class imbalance because you seem quite knowledgeable, so I assume you used the usual methods to deal with class imbalance. In case I'm wrong, let me know and I'll add a couple tricks, citing questions dealing specifically with class imbalance if needed.