Solved – the difference between model selection and hyperparameter tuning

hyperparameterpredictive-modelssupervised learning

In the context of supervised learning, in most statistics based texts and papers, one reads about model selection. For example Hastie, Tibshirani and Friedman in ESL define it as:

Model Selection: estimating the performance in order to choose the best one.

Machine Learning literature and papers on the other hand usually talk about hyperparameter optimization or hyperparameter tuning. From Wikipedia:

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm.

From what I can see, both are trying to achieve the same objective (choose the best model among a set of predefined models) and can use similar techniques (such as cross validation).

But at the same time, model selection seems to employe certain concepts that never come up in hyperparamter optimization, such as using the AIC or the BIC to select the best model.

Is this just a difference in terminology or is there a conceptual difference between the two that I am missing?

Best Answer

The way I look at it (others may disagree!) is that it's all the same problem but some hyperparameters are easier to judge the effects of and optimize than others, and you aren't always able to give an acceptable quantification of every aspect under consideration.

For instance, you could fit a ridge penalized logistic regression and jointly optimize the link function, which features are included, and the ridge penalty by a search over $$ \{\text{probit},\text{logit}\} \times \{0,1\}^p \times [0,\infty) $$ to minimize the negative log likelihood. But if you're in a typical statistics situation this will be a really high variance optimization (it's about as discrete as it gets, so good luck doing this well for a large number of features) and will really hurt your generalization probably, plus you'll probably want to make these decisions on scientific concerns anyway. So it's not that you couldn't treat these all as one big hyperparamer and optimize them, but it's more that that just isn't a helpful way to look at it. So instead you'd pick a sensible link and include all the features that you think make scientific sense, and then tune only the ridge penalty (if you even still want to do a ridge regression).

Or maybe you have 5 different models and you evaluate them on AIC/BIC. This is like having a one dimensional grid search with each cell being a model so it's not actually any different. But probably you're not just thinking about the *IC values and there are other concerns not represented by that one number, so you wouldn't actually do this as an optimization because your objective function fails to capture every aspect of the problem. Other parameters, like $\lambda$ in a ridge regression, don't have as much of an interpretation or scientific issue so it's no problem to just optimize it, and it's a feasible thing to do too.

And speaking of *IC, you can definitely use AIC and BIC for more machine learning-style models. They both have asymptotic relationships to cross validation so it's all getting at the same idea. Just as an example, I found this paper AIC and BIC based approaches for SVM parameter value estimation with RBF kernels from 2012 by Demyanov et al. so there are definitely people in machine learning thinking about these things.

So that's my opinion, at least: there aren't any fundamental differences but in practice there are a lot of modeling decisions that we're not just going to cross validate over so it's nice to have other tools for them. Sometimes it's easy criteria like *IC (these don't require fitting a model on multiple subsets so they are pretty convenient if you're not basing your life on them), other times graphical assessments of a model or scientific concerns, and other times we can reduce it to a numerical optimization.

Related Solutions

Solved – How is the standard error of the estimated prediction error calculated

You can see that the $y$-axis is labeled CV-Error, which gives a clue.

Using cross validation, we can do the following:

for each cross validation fold train[i], test[i]:
    for each model parameter p:
        train model with parameter p on train[i]
        score model with parameter p on test[i]
        set s_{ip} = the score above

Then the line plot in the above graphs is

mean(s_{ip}) where p is fixed and i varaies

And the error bars are derived from

sd(s_{ip}) where p is fixed and i varaies

So, in words, for each parameter the mean out of fold score is calculated across all the folds, and the standard error of the out of fold score is calculated across all the folds.

Solved – Advantages of Particle Swarm Optimization over Bayesian Optimization for hyperparameter tuning

As the lead developer of Optunity I'll add my two cents.

We have done extensive benchmarks comparing Optunity with the most popular Bayesian solvers (e.g., hyperopt, SMAC, bayesopt) on real-world problems, and the results indicate that PSO is in fact not less efficient in many practical cases. In our benchmark, which consists of tuning SVM classifiers on various datasets, Optunity is actually more efficient than hyperopt and SMAC, but slightly less efficient than BayesOpt. I would love to share the results here, but I'm going to wait until Optunity is finally published in JMLR (under review for over a year now, so don't hold your breath ...).

As you indicate, increased efficiency is a commonly used selling point for Bayesian optimization, but in practice it only holds water if the assumptions of the underlying surrogate models hold, which is far from trivial. In our experiments, Optunity's very simple PSO solver is often competitive with complex Bayesian approaches in terms of number of function evaluations. Bayesian solvers work very well when provided with good priors, but with an uninformative prior there is virtually no structural benefit over metaheuristic methods like PSO in terms of efficiency.

A big selling point for PSO is the fact it's embarassingly parallel. Bayesian optimization is often hard to parallelize, due to its inherently sequential nature (hyperopt's implementation being the only real exception). Given opportunities to distribute, which is becoming the norm, Optunity quickly takes the lead in wall-clock time to obtain good solutions.

Another key difference between Optunity and most other dedicated hyperparameter optimization libraries is the target audience: Optunity has the simplest interface and is targetted towards non-machine learning experts, whereas most other libraries require some understanding of Bayesian optimization to use effectively (i.e., they are targetted towards specialists).

The reason we made the library is that despite the fact dedicated hyperparameter optimization methods exist, they lack adoption in practice. Most people are still either not tuning at all, doing it manually, or via naive approaches like grid or random search. In our opinion, a key reason for this is the fact that existing libraries prior to developing Optunity were too difficult to use in terms of installation, documentation, API and often limited to a single environment.

Best Answer

Related Solutions

Solved – How is the standard error of the estimated prediction error calculated

Solved – Advantages of Particle Swarm Optimization over Bayesian Optimization for hyperparameter tuning

Related Question