Neural Networks – Comparing Big Neural Network vs Ensemble of Small Neural Networks

modelneural networks

Assuming the cost of training these models isn't an issue, is it advantageous to model data using an ensemble of "shallow" neural networks over a single deep neural network?

My thoughts right now are that ensembling can help reduce variance, and making a larger, deeper neural network can help reduce bias. Both approaches can reduce mean square error, but it's difficult to determine which approach may be more fruitful without knowing more about the situation and the nature of the shallow/deeper models.

Does my reasoning make sense?

Best Answer

I agree with your thoughts. However, small models tend to have high bias and large models have high variance (see S. Geman et al.: Neural Networks and the Bias/Variance Dilemma). So, if the model is too small, ensembling may not help because almost all the error is due to bias. And if cost of training isn't an issue, ensembling the large models may be best of all!

Another consideration is how you construct the ensemble methods - e.g. do you vary the weight initialisations, train using different subsets of the data, or vary the model architecture? Each of these can introduce the variance between the ensemble members that ensembling can then take advantage of (J. Brownlee: Ensemble Learning Methods for Deep Learning Neural Networks). But if you are using small models, varying the architecture can result in the most variance between predictions.

Also, if you have access to this paper, you may find the discussion section of N. Ueda & R.Nakano: Generalization error of ensemble estimators of interest. The authors pose the question, that instead of ensembling using models trained on subsets of the data "Why don't you train a single estimator by using all N training samples at once?", which shows that at least in some cases, it's better to train a single model on all the data, rather than training several models on subsets of the data.