Solved – Alternatives to L1, L2 and Dropout generalization

deep learningdropoutmachine learningneural networksregularization

I have the following setup for a Finance/Machine Learning research project at my university:
I'm applying a (Deep) Neural Network (MLP) with the following structure in Keras/Theano to distinguish outperforming stocks (label 1) from underperforming stocks (label 0). In the first place I just use actual and histrorical valuation multiples. Because it is stock data, one can expect to have very noisy data. Furthermore a stable out of sample accuracy of above 52% could be already considered as good in this domain.

The structure of the network:

  • Dense Layer with 30 features as input
  • Relu-Activation
  • Batch Normalization Layer (Without that, the network is partly not converging at all)
  • Optional Dropout Layer
  • Dense
  • Relu
  • Batch
  • Dropout
  • ….Further layers, with the same structure
  • Dense Layer with Sigmoid activation

Optimizer: RMSprop

Loss-Function: Binary Cross-Entropy

The only thing I do for pre processing is a rescaling of the features to the [0,1] range.

Now I'm running into a typical overfitting/underfitting problem, which I would normally tackle with Dropout or/and L1 and L2 kernel regularization. But in this case both Dropout and L1 and L2 regularization have a bad impact on the performance, as you can see in the following charts.

My basic setup is: 5 Layer NN (incl. input and output layer), 60 Neurons per Layer, Learning Rate of 0.02, no L1/L2 and no dropout, 100 Epochs, Batch Normalization, Batch-Size 1000. Everything is trained on 76000 input samples (nearly balanced classes 45%/55%) and applied to approximately the same amount of test samples.
For the charts I have only changed one parameter at a time. "Perf-Diff" means the average stock performance difference of stocks classified as 1 and stocks classified as 0, which is basically the core metric at the end. (Higher is better)

Regularization
In the l1 case the network is basically classifying every sample to one class. The spike is occuring because the network is doing this again but classifies 25 samples randomly correct. So this spike should not be interpreted as a good result, but an outlier.

The other parameters have the following impact:
enter image description here

Do you have any ideas how I could improve my results? Is there an obvious error I'm doing or is there an easy answer to the regularization results? Would you suggest to do any kind of feature selection before Training (e.g. PCA)?

Edit: Further Parameters:
enter image description here

Best Answer

Given that it's financial data, it's likely that the feature distributions in your train and validation sets are different - a phenomenon known as covariate shift - and neural networks don't tend to play well with this. Having different feature distributions can cause overfitting even if the network is relatively small.

Given that l1 and l2 don't help things I suspect other standard regularization measures like adding noise to inputs/weights/gradients probably won't help, but it might be worth a try.

I would be tempted to try a classification algorithm which is less affected by the absolute magnitudes of features, such a gradient boosted treees.

Related Question