Bayesian Overfitting – An In-depth Look at Bayesian Thinking About Overfitting

bayesiancross-validationpredictive-modelsregression-strategiesvalidation

I've devoted much time to development of methods and software for validating predictive models in the traditional frequentist statistical domain. In putting more Bayesian ideas into practice and teaching I see some key differences to embrace. First, Bayesian predictive modeling asks the analyst to think hard about prior distributions that may be customized to the candidate features, and these priors will pull the model towards them (i.e., achieve shrinkage/penalization/regularization with different amounts of penalization for different predictive features). Second, the "real" Bayesian way does not result in a single model but one gets an entire posterior distribution for a prediction.

With those Bayesian features in mind, what does overfitting mean? Should we assess it? If so, how? How do we know when a Bayesian model is reliable for field use? Or is that a moot point since the posterior will carry along all of the caution-giving uncertainties when we use the model we developed for prediction?

How would the thinking change if we forced the Bayesian model to be distilled to a single number, e.g., posterior mean/mode/median risk?

I see some related thinking here. A parallel discussion may be found here.

Follow-up question: : If we are fully Bayesian and spend some time thinking about the priors before seeing the data, and we fit a model where the data likelihood was appropriately specified, are we compelled to be satisfied with our model with regard to overfitting? Or do we need to do what we do in the frequentist world where a randomly chosen subject may be predicted well on the average, but if we choose a subject who has a very low prediction or one having a very high predicted value there will be regression to the mean?

Best Answer

I might start by saying that a Bayesian model cannot systematically overfit (or underfit) data that are drawn from the prior predictive distribution, which is the basis for a procedure to validate that Bayesian software is working correctly before it is applied to data collected from the world.

But it can overfit a single dataset drawn from the prior predictive distribution or a single dataset collected from the world in the sense that the various predictive measures applied to the data that you conditioned on look better than those same predictive measures applied to future data that are generated by the same process. Chapter 6 of Richard McElreath's Bayesian book is devoted to overfitting.

The severity and frequency of overfitting can be lessened by good priors, particularly those that are informative about the scale of an effect. By putting vanishing prior probability on implausibly large values, you discourage the posterior distribution from getting overly excited by some idiosyncratic aspect of the data that you condition on that may suggest an implausibly large effect.

The best ways of detecting overfitting involve leave-one-out cross-validation, which can be approximated from a posterior distribution that does not actually leave any observations out of the conditioning set. There is an assumption that no individual "observation" [*] that you condition on has an overly large effect on the posterior distribution, but that assumption is checkable by evaluating the size of the estimate of the shape parameter in a Generalized Pareto distribution that is fit to the importance sampling weights (that are derived from the log-likelihood of an observation evaluated over every draw from the posterior distribution). If this assumption is satisfied, then you can obtain predictive measures for each observation that are as if that observation had been omitted, the posterior had been drawn from conditional on the remaining observations, and the posterior predictive distribution had been constructed for the omitted observation. If your predictions of left out observations suffer, then your model was overfitting to begin with. These ideas are implemented in the loo package for R, which includes citations such as here and there.

As far as distilling to a single number goes, I like to calculate the proportion of observations that fall within 50% predictive intervals. To the extent that this proportion is greater than one half, the model is overfitting, although you need more than a handful of observations in order to cut through the noise in the inclusion indicator function. For comparing different models (that may overfit), the expected log predictive density (which is calculated by the loo function in the loo package) is a good measure (proposed by I.J. Good) because it takes into account the possibility that a more flexible model may fit the available data better than a less flexible model but is expected to predict future data worse. But these ideas can be applied to the expectation of any predictive measure (that may be more intuitive to practitioners); see the E_loo function in the loo package.

[*] You do have to choose what constitutes an observation in a hierarchical model. For example, are you interested in predicting a new patient or a new time point for an existing patient? You can do it either way, but the former requires that you (re)write the likelihood function to integrate out the patient-specific parameters.

Related Question