Solved – Best approach for model selection Bayesian or cross-validation

bayesiancross-validationfeature selectionmodel selection

When trying to select among various models or the number of features to include for, say prediction I can think of two approaches.

  1. Split the data into training and test sets. Better still, use bootstrapping or k-fold cross-validation. Train on the training set each time and calculate the error over the test set. Plot test error vs. number of parameters. Usually, you get something like this:enter image description here
  2. Compute the likelihood of the model by integrating over the values of the parameters. i.e., compute $\int_\theta P(D|\theta)P(\theta)d \theta$, and plotting this against the number of parameters. We then get something like this:enter image description here

So my questions are:

  1. Are these approaches suitable for solving this problem (deciding how many parameters to include in your model, or selecting among a number of models)?
  2. Are they equivalent? Probably not. Will they give the same optimal model under certain assumptions or in practice?
  3. Other than the usual philosophical difference of specifying prior knowledge in Bayesian models etc., what are the pros and cons of each approach? Which one would you chose?

Update:
I also found the related question on comparing AIC and BIC. It seems that my method 1 is asymptotically equivalent to AIC and method 2 is asymptotically related to BIC. But I also read there that BIC is equivalent to Leave-One-Out CV. That would mean that the training error minimum and Bayesian Likelihood maximum are equivalent where LOO CV is equivalent to K-fold CV. A perhaps very interesting paper "An asymptotic theory for linear model selection" by Jun Shao relates to these issues.

Best Answer

  1. Are these approaches suitable for solving this problem (deciding how many parameters to include in your model, or selecting among a number of models)?

Either one could be, yes. If you're interested in obtaining a model that predicts best, out of the list of models you consider, the splitting/cross-validation approach can do that well. If you are interested in known which of the models (in your list of putative models) is actually the one generating your data, then the second approach (evaluating the posterior probability of the models) is what you want.

  1. Are they equivalent? Probably not. Will they give the same optimal model under certain assumptions or in practice?

No, they are not in general equivalent. For example, using AIC (An Information Criterion, by Akaike) to choose the 'best' model corresponds to cross-validation, approximately. Use of BIC (Bayesian Information Criterion) corresponds to using the posterior probabilities, again approximately. These are not the same criterion, so one should expect them to lead to different choices, in general. They can give the same answers - whenever the model that predicts best also happens to be the truth - but in many situations the model that fits best is actually one that overfits, which leads to disagreement between the approaches.

Do they agree in practice? It depends on what your 'practice' involves. Try it both ways and find out.

  1. Other than the usual philosophical difference of specifying prior knowledge in Bayesian models etc., what are the pros and cons of each approach? Which one would you choose?
  • It's typically a lot easier to do the calculations for cross-validation, rather than compute posterior probabilities
  • It's often hard to make a convincing case that the 'true' model is among the list from which you are choosing. This is a problem for use of posterior probabilities, but not cross-validation
  • Both methods tend to involve use of fairly arbitrary constants; how much is an extra unit of prediction worth, in terms of numbers of variables? How much do we believe each of the models, a priori?
    • I'd probably choose cross-validation. But before committing, I'd want to know a lot about why this model-selection was being done, i.e. what the chosen model was to be used for. Neither form of model-selection may be appropriate, if e.g. causal inference is required.
Related Question