Solved – Final layer of neural network responsible for overfitting

boostinggradient descentneural networks

I am using a multi-layer perceptron with 2 hidden layers to solve a binary classification task on a noisy timeseries dataset with a class imbalance of 80/20. I have 30 million rows and 500 features in the training set. The dataset is structured, ie, not images. My original features were highly right skewed; I do my best to transform these into nicer distributions by either taking logs or categorising some of them. I use an architecture of 512->128->128->1, with relu activations in every layer except the last. My loss function is sigmoid cross entropy.

The validation set contains 10 million rows. Initially the validation error goes down, but then starts to go up again after a couple of epochs. On analysing the gradients and weights of each layer, I see that the overfitting coincides with the weights on the final layer only getting larger and larger. The final layer seems to go into overdrive while the rest of the network seems to do very little learning.

I can solve the overfitting problem by using l2 regularisation, but this hurts the validation error. I have yet to find a beta regularisation parameter which doesn't hurt the best validation error I've seen. Dropout makes things even worse.

Granted, the classification problem is very difficult, with probably a very weak signal, but I find that gradient boosted trees are able to generalize much better than a simple, say, 64×64 multi-layer perceptron (the log loss on the training set is the same for both network and gradient boosted tree).

Are there any words of wisdom on how to make this network generalize better given that I've already tried:

dropout of varying degrees
l1/l2/group lasso regularization
adding noise to inputs
adding noise to gradients and weights
feature-engineering so as to remove/re-represent highly skewed features
batch normalization
using a lower learning rate on the final layer
simply using a smaller network (this is the best solution I've found)

to some or all layers. All methods hurt the validation error so much that the performance is nowhere near how well the tree model does. I would have given up by now were it not for the fact that the tree model is able to do so much better out of sample, but the training log loss for both is the same.

Best Answer

With sample size $N=30\times10^6$ and 500 features, you already tried (most of) the usual regularization tricks, thus it doesn't look like there's much left to do at this point.

However, maybe the problem here is upstream. You haven't told us what's your dataset, exactly (what are the observations? What are the features?) and what are you trying to classify. You also don't describe in detail your architecture (how many neurons do you have? which activation functions are you using? What rule do you use to convert the output layer result into a class choice?). I will proceed under the assumptions that:

you have 512 units in input layer, 512 units in each of the hidden layers and 2 units in the output layer. corresponding to $p=525312$ parameters. In this case, your data set seems large enough to learn all weights.
you're using One-Hot Encoding to perform classification.

Correct me if my assumptions are wrong. Now:

if you have structured data (this means you're not doing image classification), maybe there's just nothing you can do. Usually XGboost just beats DNNs on structured data classification. Have a look the Kaggle competitions: you'll see that for structured data, usually the winning teams use ensembles of extreme gradient boosted trees, not Deep Neural Networks.
if you have unstructured data, then something's weird: usually DNNs dominate XGboost here. If you're doing image classification, don't use an MLP. Mostly everyone now uses a CNN. Also, be sure you don't use sigmoid activation functions, but stuff such as ReLU.
You didn't try early stopping and learning rate decay. Early stopping usually "plays nice" with most other regularization methods and it's easy to implement, so that's the first thing I'd try, if I were in you. In case you're not familiar with early stopping, read this nice answer: Early stopping vs cross validation
If nothing else helps, you should check for errors in your code. Can you try to write unit tests? If you're using Tensorflow, Theano or MXNet, can you switch to an high level API such as Keras or PyTorch? One might expect that using an high level API, where less customization is possible, would drive your test error up, not down. However, often the opposite happens, because the higher level API allows you to do the same work with much less code, and thus much less opportunity for mistakes. At the very least, you can be sure your high test error isn't due to coding bugs....

Finally, I didn't add anything about dealing with class imbalance because you seem quite knowledgeable, so I assume you used the usual methods to deal with class imbalance. In case I'm wrong, let me know and I'll add a couple tricks, citing questions dealing specifically with class imbalance if needed.

Related Solutions

Solved – L2-norms of gradients increasing during training of deep neural network

I guess I got what is a problem with a gradient norm value. Basically negative gradient shows a direction to a local minimum value, but it doesn't say how far it is. For this reason you are able to configure you step proportion. When your weight combination is closer to the minimum value your constant step could be bigger than is necessary and some times it hits in wrong direction and in next epooch network try to solve this problem. Momentum algorithm use modified approach. After each iteration it increases weight update if sign for the gradient the same (by an additional parameter that is added to the $\Delta w$ value). In terms of vectors this addition operation can increase magnitude of the vector and change it direction as well, so you are able to miss perfect step even more. To fix this problem network sometimes needs a bigger vector, because minimum value a little further than in the previous epoch.

To prove that theory I build small experiment. First of all I reproduce the same behaviour but for simpler network architecture with less number of iterations.

import numpy as np
from numpy.linalg import norm
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from neupy import algorithms

plt.style.use('ggplot')

grad_norm = []
def train_epoch_end_signal(network):
    global grad_norm
    # Get gradient for the last layer
    grad_norm.append(norm(network.gradients[-1]))

data, target = make_regression(n_samples=10000, n_features=50, n_targets=1)

target_scaler = preprocessing.MinMaxScaler()
target = target_scaler.fit_transform(target)

mnet = Pipeline([
    ('scaler', preprocessing.MinMaxScaler()),
    ('momentum', algorithms.Momentum(
        (50, 30, 1),
        step=1e-10,
        show_epoch=1,
        shuffle_data=True,
        verbose=False,
        train_epoch_end_signal=train_epoch_end_signal,
    )),
])

mnet.fit(data, target, momentum__epochs=100)

After training I checked all gradients on plot. Below you can see similar behaviour as yours.

plt.figure(figsize=(12, 8))
plt.plot(grad_norm)
plt.title("Momentum algorithm final layer gradient 2-Norm")
plt.ylabel("Gradient 2-Norm")
plt.xlabel("Epoch")
plt.show()

Also if look closer into the training procedure results after each epoch you will find that errors are vary as well.

plt.figure(figsize=(12, 8))
network = mnet.steps[-1][1]
network.plot_errors()
plt.show()

Next I using almost the same settings create another network, but for this time I select Golden search algorithm for step selection on each epoch.

grad_norm = []
def train_epoch_end_signal(network):
    global grad_norm
    # Get gradient for the last layer
    grad_norm.append(norm(network.gradients[-1]))
    if network.epoch % 20 == 0:
        print("Epoch #{}: step = {}".format(network.epoch, network.step))

mnet = Pipeline([
    ('scaler', preprocessing.MinMaxScaler()),
    ('momentum', algorithms.Momentum(
        (50, 30, 1),
        step=1e-10,
        show_epoch=1,
        shuffle_data=True,
        verbose=False,
        train_epoch_end_signal=train_epoch_end_signal,
        optimizations=[algorithms.LinearSearch]
    )),
])

mnet.fit(data, target, momentum__epochs=100)

Output below shows step variation at each 20 epoch.

Epoch #0: step = 0.5278640466583575
Epoch #20: step = 1.103484809236065e-13
Epoch #40: step = 0.01315561773591515
Epoch #60: step = 0.018180616551587894
Epoch #80: step = 0.00547810271094794

And if you after that training look closer into the results you will find that variation in 2-norm is much smaller

plt.figure(figsize=(12, 8))
plt.plot(grad_norm)
plt.title("Momentum algorithm final layer gradient 2-Norm")
plt.ylabel("Gradient 2-Norm")
plt.xlabel("Epoch")
plt.show()

And also this optimization reduce variation of errors as well

plt.figure(figsize=(12, 8))
network = mnet.steps[-1][1]
network.plot_errors()
plt.show()

As you can see the main problem with gradient is in the step length.

It's important to note that even with a high variation your network can give you improve in your prediction accuracy after each iteration.

Solved – Which elements of a Neural Network can lead to overfitting

Increasing the number of hidden units and/or layers may lead to overfitting because it will make it easier for the neural network to memorize the training set, that is to learn a function that perfectly separates the training set but that does not generalize to unseen data.

Regarding the batch size: combined with the learning rate the batch size determines how fast you learn (converge to a solution) usually bad choices of these parameters lead to slow learning or inability to converge to a solution, not overfitting.

The number of epochs is the number of times you iterate over the whole training set, as a result, if your network has a large capacity (a lot of hidden units and hidden layers) the longer you train for the more likely you are to overfit. To address this issue you can use early stopping which is when you train you neural network for as long as the error on an external validation set keeps decreasing instead of a fixed number of epochs.

In addition, to prevent overfitting overall you should use regularization some techniques include l1 or l2 regularization on the weights and/or dropout. It is better to have a neural network with more capacity than necessary and use regularization to prevent overfitting than trying to perfectly adjust the number of hidden units and layers.

Best Answer

Related Solutions

Solved – L2-norms of gradients increasing during training of deep neural network

Solved – Which elements of a Neural Network can lead to overfitting

Related Question