Solved – a sensible order for parameter tuning in neural networks

hyperparameterneural networksoptimization

There are so many aspects one could possibly change in a deep neural network that it is generally not feasible to do a grid search over all of them (e.g. activation function, layer type, number of neurons, number of layers, optimizer type, optimizer hyperparameters, etc.). It may not even be desirable if it were possible, as it amounts to comparing a huge number of models, which makes me think the 'best' the out-of-sample performance estimate could just be incidental.

From what I understand it is therefore common to adjust the network in a greedy fashion, updating one or several out of many components at a time.

Hence, when tuning the various components of a neural network by hand, what is considered to be a sensible order?

(For example, changing the activation function at a late stage is maybe not a good idea.)


This question also appears to be related, but the accepted answer is not really about the order of optimizing hyperparameters, but rather how to train a predefined set of models with different architectures.

Best Answer

There are so many aspects one could possibly change in a deep neural network that it is generally not feasible to do a grid search over all of them.

True. But an alternative to that, i.e. random search is feasible in many cases. Please have a look at this post for an interesting explanation.

Hence, when tuning the various components of a neural network by hand, what is considered to be a sensible order?

The hyper-parameters interact but for practical purposes they can be tuned independently, as these interactions have no apparent structure. So, the order in which the hyper-parameters need to be tuned is largely subjective. But one of the recommendations from this paper, which I do follow is, one should tune the learning rate first. That saves a lot of experimentation. For illustration of the importance of the learning rate have a look at the image taken from the linked paper. They have experimented with different variants of LSTM over three datasets and have presented the performance over the test set. The chart shows what fraction of the test set performance variance can be attributed to different hyper-parameters

They also show that optimal value of learning rate is dependent on the dataset.

enter image description here

So, if I have to answer the order that I follow for training neural networks I would stick with this:

a) Optimizer

b) Learning rate

c) batch size

d) Input noise

e) Network design -number of hidden layers and number of neurons

f) Regularizers - (L1, L2, dropout etc.)

But, again, every dataset is different and hyper-parameters will surely be dependent on that. So, for every problem one approach won't do. Plotting the error will give the feel for the dataset and help in finding the 'optimal' hyper-parameters.

Some posts that might be useful:

a) In what order should we tune hyperparameters in Neural Networks?

b) Hyperparameter tuning for machine learning models.

c) A Conceptual Explanation of Bayesian Hyperparameter Optimization for Machine Learning

d) Hyper parameters tuning: Random search vs Bayesian optimization

e) hyperparameter tuning in neural networks