Solved – the difference between model selection and hyperparameter tuning

hyperparameterpredictive-modelssupervised learning

In the context of supervised learning, in most statistics based texts and papers, one reads about model selection. For example Hastie, Tibshirani and Friedman in ESL define it as:

  • Model Selection: estimating the performance in order to choose the best one.

Machine Learning literature and papers on the other hand usually talk about hyperparameter optimization or hyperparameter tuning. From Wikipedia:

  • In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm.

From what I can see, both are trying to achieve the same objective (choose the best model among a set of predefined models) and can use similar techniques (such as cross validation).

But at the same time, model selection seems to employe certain concepts that never come up in hyperparamter optimization, such as using the AIC or the BIC to select the best model.

Is this just a difference in terminology or is there a conceptual difference between the two that I am missing?

Best Answer

The way I look at it (others may disagree!) is that it's all the same problem but some hyperparameters are easier to judge the effects of and optimize than others, and you aren't always able to give an acceptable quantification of every aspect under consideration.

For instance, you could fit a ridge penalized logistic regression and jointly optimize the link function, which features are included, and the ridge penalty by a search over $$ \{\text{probit},\text{logit}\} \times \{0,1\}^p \times [0,\infty) $$ to minimize the negative log likelihood. But if you're in a typical statistics situation this will be a really high variance optimization (it's about as discrete as it gets, so good luck doing this well for a large number of features) and will really hurt your generalization probably, plus you'll probably want to make these decisions on scientific concerns anyway. So it's not that you couldn't treat these all as one big hyperparamer and optimize them, but it's more that that just isn't a helpful way to look at it. So instead you'd pick a sensible link and include all the features that you think make scientific sense, and then tune only the ridge penalty (if you even still want to do a ridge regression).

Or maybe you have 5 different models and you evaluate them on AIC/BIC. This is like having a one dimensional grid search with each cell being a model so it's not actually any different. But probably you're not just thinking about the *IC values and there are other concerns not represented by that one number, so you wouldn't actually do this as an optimization because your objective function fails to capture every aspect of the problem. Other parameters, like $\lambda$ in a ridge regression, don't have as much of an interpretation or scientific issue so it's no problem to just optimize it, and it's a feasible thing to do too.

And speaking of *IC, you can definitely use AIC and BIC for more machine learning-style models. They both have asymptotic relationships to cross validation so it's all getting at the same idea. Just as an example, I found this paper AIC and BIC based approaches for SVM parameter value estimation with RBF kernels from 2012 by Demyanov et al. so there are definitely people in machine learning thinking about these things.

So that's my opinion, at least: there aren't any fundamental differences but in practice there are a lot of modeling decisions that we're not just going to cross validate over so it's nice to have other tools for them. Sometimes it's easy criteria like *IC (these don't require fitting a model on multiple subsets so they are pretty convenient if you're not basing your life on them), other times graphical assessments of a model or scientific concerns, and other times we can reduce it to a numerical optimization.