Neural Network – Fitting with More Parameters Than Observations

kerasmachine learningneural networks

I'm training a neural network for regression using keras with about 13k training observations, each with 40 features.

It's a Sequential model with Dense layers.
I generate random architectures for the hidden layers i.e. a random number of hidden layers and a random number of nodes in each layer. The input and output layers are fixed and not random.

The models are fitted and the summaries printed. The model summary tells me the number of Total params and Trainable params and Non-trainable params e.g.

Total params: 2,052,948

Trainable params: 2,052,948

Non-trainable params: 0

I am interpreting those as the weights and biases for the network.

One problem understanding this is that it tells me there are several million trainable params which is much greater than the number of observations.

The loss decreases substantially so the fitting appears to have succeeded.

How can the fitting calculate this number of parameters given far fewer available observations?

Best Answer

This question has been bothering me for while. I haven't seen any satisfying explanations, so far. So, here is what I think is going on.

Everything you say here is correct. The trainable parameters are the weights and the biases of the network. (If one is using trainable embedding layers, the embedding weights are also included in the number of trainable parameters.) The fitting does succeed, the validation loss decreases. And you are also correct in saying that the parameters cannot be determined using far fewer observations. I would just qualify this statement with one word, the parameters cannot be uniquely determined using far fewer observations. In fact, if you run fitting again and again, you will get pretty much the same validation loss, but very different weights, precisely because you don't have enough data to uniquely calculate them.

The point is that the goal of the fitting is not to find the values of the parameters. Those values have no inherent meaning or value. The goal of the fitting is to be able to predict the target variable(s) giving the predictor variable(s) as well as possible, which translates into minimizing the loss. The goal is to label a picture of a dog with the label 'dog', and what those weights and biases are in that 100 layer network is pretty much irrelevant.

Contrast this with how fitting is done in 'real' sciences (as opposed to data science :) ), such as physics or chemistry. There you'd have a theory that describes some natural phenomena. The theory would have several parameters representing some physical properties. You'd collect some experimental data and fit the data with the theoretical predictions. Mathematically, you'd be still minimizing the loss function, but your goal would be the actual values of those parameters. The goal is find the masses of those elementary particle, or conductivity, or flammability, or whatever.

As a side point, what is also surprising is that to a great extent using DNNs one can avoid overfitting even when the number of parameters are one or two orders of magnitude greater than the training data size. With so many 'extra' parameters if you run it long enough the training loss will go to zero. What's interesting and different about DNNs is that the fitting generalizes well enough to significantly reduce the validation loss. Necessary precautions should be taken, of course, such as dropout, regularization, etc.

Related Solutions

Solved – How to improve the accuracy of a neural network model

What @Chaconne mentioned in the comments is quite important. You should first shuffle your training set and then split the array into chunks.

But I rewrote your code to the Javascript neural network library Neataptic, and i'm not having any issues. This is the code I wrote:

// Create the network
var network = new Architect.Perceptron(1,2,1);

// Create training set
var set = [];
for(var i = 1; i <= 5000; i++){
  var x = i * 0.05;
  x = x / (5000 * 0.05);

  var y = Math.sin(x);
  y = (y + 1) / 2;

  set.push({ input: [x], output: [y]});
}

var trainingSet = set.splice(0, 3500);
var testSet = set;

// Train the network
var training = network.train(trainingSet, {
  iterations: 500, // it will never reach this
  error: 0.0001,
  rate: 0.01,
  cost: Methods.Cost.MSE
});

var testError = network.test(testSet, Methods.Cost.MSE);

Run the code yourself here (open console)!

I'm getting these kind of results somewhat consistently:

training error: 0.00008
test error:  0.00133

So dropout would fix the large gap, but my point is, something else must be wrong in your model. Consistently getting 0.48 as prediction seems like the network has never learned anything.

What is your training error? What is your test error? How long are you training for (iterations).

Solved – Which elements of a Neural Network can lead to overfitting

Increasing the number of hidden units and/or layers may lead to overfitting because it will make it easier for the neural network to memorize the training set, that is to learn a function that perfectly separates the training set but that does not generalize to unseen data.

Regarding the batch size: combined with the learning rate the batch size determines how fast you learn (converge to a solution) usually bad choices of these parameters lead to slow learning or inability to converge to a solution, not overfitting.

The number of epochs is the number of times you iterate over the whole training set, as a result, if your network has a large capacity (a lot of hidden units and hidden layers) the longer you train for the more likely you are to overfit. To address this issue you can use early stopping which is when you train you neural network for as long as the error on an external validation set keeps decreasing instead of a fixed number of epochs.

In addition, to prevent overfitting overall you should use regularization some techniques include l1 or l2 regularization on the weights and/or dropout. It is better to have a neural network with more capacity than necessary and use regularization to prevent overfitting than trying to perfectly adjust the number of hidden units and layers.

Best Answer

Related Solutions

Solved – How to improve the accuracy of a neural network model

Solved – Which elements of a Neural Network can lead to overfitting

Related Question