Solved – Bayesian neural networks: very multimodal posterior

bayesianidentifiabilitymachine learningneural networksposterior

Question:

How do Bayesian treatments of neural networks address the fact that the posterior has an exponentially large number of modes?

Background:

There seems to be a lot of interest in Bayesian treatments of neural networks, where we attempt to model the posterior distribution over network weights given the data, using e.g. Laplace approximation, Monte Carlo, or variational methods. In principle, this would allow you to integrate over model parameters to avoid overfitting and to provide well-calibrated uncertainty estimates for predictions.

For multilayer perceptrons, the posterior has an exponentially large number of symmetrical modes since the parameters are not identifiable. (As pointed out in Kevin Murphy's book "Machine Learning: A Probabilistic Perspective", Chapter 16.5.5, we can permute the identities of any of the hidden units without affecting the likelihood, leading to $H!$ equivalent settings of the parameters, where $H$ is the number of hidden units. If the neural net uses an activation function like $\tanh$ which is an odd function ($-\tanh(x)=\tanh(-x)$), there are also $2^H$ sign-flip degeneracies since we can pick a hidden unit and flip the sign of all its incoming edges as long as we also flip the sign of all its outgoing edges.)

So for even a tiny feedforward net with $H=15$, the posterior will have $>10^{12}$ posterior modes. This sounds like it could be a big problem for Monte Carlo approximations, for example, since there's no way you could draw even one sample from each of the modes. On the other hand, I guess it could be the case that since the posterior modes introduced by parameter unidentifiability all equivalent, you're fine as long as you model at least one of them well…

Is this actually a problem? If so, how can it be addressed?

Best Answer

Regarding the question how the non-identifiability can be addressed, I can recommend to have a look at Improving the Identifiability of Neural Networks for Bayesian Inference, which "eliminates" the (discrete) combinatorial non-identifiability problem through ordering of nodes (as one of the comments suspected). The paper also addresses a continuous non-identifiability problem (related to rescaling-invariance in RELUs) and tries to solve this, too. Very similar problems are encountered in Bayesian mixture models and can be "solved", c.f. the excellent tutorial Identifying Bayesian Mixture Models.

Unfortunately, it appears that even after one considers the above, there remains the risk of multiple modes, as discussed here Why are Bayesian Neural Networks multi-modal?.

I can also recommend to read section 3.7 of the paper “Issues in Bayesian Analysis of Neural Network Models”, which discusses mechanisms leading to multi-modal behaviour. Besides the ones already mentioned, they also discuss a problem they refer to as "node-duplication"!

Related Question