Solved – Using K-fold cross validation as a hint for number of neurons in hidden layer. Possible flaws

cross-validationneural networksparameterizationregression

I have a dataset of 14 continuous variables and would like to predict one (say y) using all the remaining ones (say X).

I decided to use a neural network with linear output (since I am doing regression): after a lot of reading and some tests, it turned out that a single hidden layer with about 6 or 7 neurons is a good enough configuration, however, I would like to use a better approach when selecting the neural network configuration rather than testing and testing again and again with just guessed configurations.

Say the net has a single hidden layer and we need to choose only the number of neurons.

I saw a question on Stack Overflow (which I cannot find anymore) where a user suggested a simulation where by plotting training error and testing error over the number of neurons in the hidden layer one can choose a number of neurons that minimizes the test error while keeping an eye on overfitting.
I'm trying to implement this and would like to be careful not to make silly mistakes. As of now, I did the following:

test.error = NULL

for(i in 1:13)
{       
    # Calculate test error through cross validation
    test.error[i] <- crossvalidate(data_,hidden_l=c(i))
}

The crossvalidate function takes the dataset and crossvalidates the neural network with i hidden neurons using 10-fold crossvalidation and returns the crossvalidation score (MSE).

Then by plotting the errors I get the following plot
enter image description here

The red line is the training error while the green one is the cross validated test error. Now my question:

Is this approach a good rule of thumb? Meaning: could I use this to get a hint of what configurations might work and then do further tests?

I'm having a hard time figuring out possible flaws since the cross validation step should rule out part of the bias due to the train-test split procedure, however I expected a more curved test error line (such as a parabola or a similar U-shaped curve such as the training error line). Do you see any flaws in this approach?
Furthermore, in this example I wouldn't know how to choose between 8 and 10 neurons and can't really explain the spike at 9.

Best Answer

What you are doing is not wrong, but you can do much better. Instead of going a single cross validation run, you should repeat it many times with different splits (but same number of folds) of the data. This will make your curve smoother.

The next question is how many folds do you use. Usually 5 or 10 fold cross-validation works best here. I would use 10-fold because the optimal number of neurons for 80% of the data could be less that the optimal number for 100%. This is also true for 90%, but a bit less likely. Bootstrapping would also be an alternative to cross validation.

It is also important not too take the test.error shown above as an estimate for the performance of the model, since you optimized a hyperparameter by cross-validation. If you want a performance estimate you need an outer cross-validation loop as well.