Solved – Under which circumstances does increasing the number of parameters increase the training loss

machine learning

Using a convolutional neural network for segmentation of images, I notice that if I increase the number of filters on each convolutional layer, the training loss increases on short-term (on long term it could be that the training loss decreases further than with the smaller model).

enter image description here
See Figure: Same model and same data and hyperparameters. Only change: Number of convolutional kernels/filters per layer. Ranging from >2 million trainable parameters (2.6M, black), down to 15 thousand trainable parameters (15K – red).

Is this something to be expected? Intuitively I would have expected the opposite: The more parameters, the higher the model complexity and the faster it can memorize/overfit the training data, thus decreasing the training loss rather than increasing it.

Can this behavior be because a complexer model takes longer to train and the beginning epochs it performs worse than a smaller model?

From the figure, two things are very noticeable:

  1. The smaller the model, the lower the starting loss. This is coherent with the intuition that a complexer model has a more 'difficult' starting point, having to explore a solution in a higher dimensional space? The loss curves however do not look like the larger models will eventually surpass the smaller models though, they seem to be plateauing already at pretty high values.
  2. The larger the model, the more abrupt the initial decay and plateauing. This seems like the more parameters the model has, the fast it lands into a local minima. This could be the case as the extremely high-dimensional loss-landscape becomes very complex and therefore the more local-minima pits..

The model is a ~20 layers 3D-CNN, with kernel sizes of 3x3x3, batchNorm on every layer, leakyReLU activation of every layer. It classifies input images of size ~25x25x25 into one of two classes (0,1). Learning rate is at 1e-05. Using Adam optimizer. Almost no regularization (no dropout, L2 ~1e-04).

Best Answer

After some tests with toy examples I have at least one answer to this question: Regularization dependent on parameters! Specifically L2 penalty (though L1 would have the same effect, as any other norm on the weights). These penalties do NOT take into account the total number of parameters, so a bigger model will have a higher loss/cost than a smaller model.

See: L2 Regularization Constant