neural-networks – Training Set Increase and Its Impact on Accuracy in Neural Networks

adamneural networkspredictive-modelsrbf-network

I've ~16000 labeled data. I split it in ~8000 for training and ~8000 for testing of my RBF neural network, find the best hyperparameters (RMSE from 1.2 to 1.4*) and finally train the model on the whole set of ~16000 labeled data. When I now run these 16000 samples through the final model, I get way worse result (RMSE from 2.0 to 2.3*). How is this possible?

My RBF has 350 neurons in the input layer, 2 in the output one and two hidden layers with 50 and 30 neurons respectively. I use gaussian distribution for weight initialization, sigmoid as an activation functions in the hidden layers and RELU in the last one, Adam to optimize and L2 as loss function.

* depending on the random seed used for train-test split and weight init

EDIT: I use batches of 400 samples and 500 epochs. So I have 10 000 training iteration when I use 8k samples for training and 20 000 when I use the whole dataset.

Best Answer

There are some reasons that could lead to that happening yes, one of such reasons could be due to the increased variance in the data (more outliers or a distributional shift after incorporating the other half of the entire dataset), you might want to employ outlier detection to detect and remove those examples from your dataset.

Also, running the train/test once (a validation scheme usually called holdout) is often not a good measure of the actual performance of any model (it may lead to unstable models both in hyperparameters and accuracy), specially in non-deterministic training models like neural networks that are very sensitive to initialization (which may also be the culprit for your problem too).

Therefore I'd suggest you to try more robust validation schemes, as at least it will give you a good grasp of what is the actual performance of your model:

  1. Holdout with 50% split may be not enough for your model to generalize properly. Usually when validating through holdout, it's recommended to use more data to train than to test, 80/20, 75/25 and even 60/40 splits are way more common.
  2. Instead of holdout, use a proper k-cross-validation scheme, you can start with 10-fold cross-validation (divide data into 10 groups, then run the training from start using 1 of the 10 to test and the others to train, each run which group is the test is changed, then evaluate the average performance obtained of all runs), since i don't know the task i'm not able to really propose something tailored for your situation, you might need to do a nested or stratified cross-validation...

Once you're satisfied with your validation scheme, I'd devise an environment to test values obtained for each hyperparameters to investigate them as well.