Neural Network – Fitting with More Parameters Than Observations

kerasmachine learningneural networks

I'm training a neural network for regression using keras with about 13k training observations, each with 40 features.

It's a Sequential model with Dense layers.
I generate random architectures for the hidden layers i.e. a random number of hidden layers and a random number of nodes in each layer. The input and output layers are fixed and not random.

The models are fitted and the summaries printed. The model summary tells me the number of Total params and Trainable params and Non-trainable params e.g.

Total params: 2,052,948

Trainable params: 2,052,948

Non-trainable params: 0

I am interpreting those as the weights and biases for the network.

One problem understanding this is that it tells me there are several million trainable params which is much greater than the number of observations.

The loss decreases substantially so the fitting appears to have succeeded.

How can the fitting calculate this number of parameters given far fewer available observations?

Best Answer

This question has been bothering me for while. I haven't seen any satisfying explanations, so far. So, here is what I think is going on.

Everything you say here is correct. The trainable parameters are the weights and the biases of the network. (If one is using trainable embedding layers, the embedding weights are also included in the number of trainable parameters.) The fitting does succeed, the validation loss decreases. And you are also correct in saying that the parameters cannot be determined using far fewer observations. I would just qualify this statement with one word, the parameters cannot be uniquely determined using far fewer observations. In fact, if you run fitting again and again, you will get pretty much the same validation loss, but very different weights, precisely because you don't have enough data to uniquely calculate them.

The point is that the goal of the fitting is not to find the values of the parameters. Those values have no inherent meaning or value. The goal of the fitting is to be able to predict the target variable(s) giving the predictor variable(s) as well as possible, which translates into minimizing the loss. The goal is to label a picture of a dog with the label 'dog', and what those weights and biases are in that 100 layer network is pretty much irrelevant.

Contrast this with how fitting is done in 'real' sciences (as opposed to data science :) ), such as physics or chemistry. There you'd have a theory that describes some natural phenomena. The theory would have several parameters representing some physical properties. You'd collect some experimental data and fit the data with the theoretical predictions. Mathematically, you'd be still minimizing the loss function, but your goal would be the actual values of those parameters. The goal is find the masses of those elementary particle, or conductivity, or flammability, or whatever.

As a side point, what is also surprising is that to a great extent using DNNs one can avoid overfitting even when the number of parameters are one or two orders of magnitude greater than the training data size. With so many 'extra' parameters if you run it long enough the training loss will go to zero. What's interesting and different about DNNs is that the fitting generalizes well enough to significantly reduce the validation loss. Necessary precautions should be taken, of course, such as dropout, regularization, etc.

Related Question