While studying hyperparameter tuning in Machine Learning, I have come to read Bayesian Optimization for Hyperparameter Tuning and Using validation set when training the model but it is kind of ambiguous to me to tell the usage of Bayesian Optimization and validation set when training the model. Using validation data set, I can measure the performance of the model adjusting the hyperparameters but I don't know when to use Bayesian Optimization for hyperparameter tuning. Hope to hear some explanations about it.

# Bayesian Optimization – Difference Between Bayesian Optimization for Hyperparameters and Validation Set Training

bayesian optimizationhyperparametermachine learning

#### Related Solutions

The difference isn't especially enlightening. Grid-search pre-specifies some set of tuples up front and tries all of them. In manual search, a human adjusts the parameters, possibly incorporating knowledge about how those adjustments will influence the behavior of the model and estimation procedure.

The way I look at it (others may disagree!) is that it's all the same problem but some hyperparameters are easier to judge the effects of and optimize than others, and you aren't always able to give an acceptable quantification of every aspect under consideration.

For instance, you could fit a ridge penalized logistic regression and jointly optimize the link function, which features are included, and the ridge penalty by a search over
$$
\{\text{probit},\text{logit}\} \times \{0,1\}^p \times [0,\infty)
$$
to minimize the negative log likelihood. But if you're in a typical statistics situation this will be a really high variance optimization (it's about as discrete as it gets, so good luck doing this well for a large number of features) and will really hurt your generalization probably, plus you'll probably want to make these decisions on scientific concerns anyway. So it's not that you *couldn't* treat these all as one big hyperparamer and optimize them, but it's more that that just isn't a helpful way to look at it. So instead you'd pick a sensible link and include all the features that you think make scientific sense, and then tune only the ridge penalty (if you even still want to do a ridge regression).

Or maybe you have 5 different models and you evaluate them on AIC/BIC. This is like having a one dimensional grid search with each cell being a model so it's not actually any different. But probably you're not *just* thinking about the *IC values and there are other concerns not represented by that one number, so you wouldn't actually do this as an optimization because your objective function fails to capture every aspect of the problem. Other parameters, like $\lambda$ in a ridge regression, don't have as much of an interpretation or scientific issue so it's no problem to just optimize it, and it's a feasible thing to do too.

And speaking of *IC, you can definitely use AIC and BIC for more machine learning-style models. They both have asymptotic relationships to cross validation so it's all getting at the same idea. Just as an example, I found this paper AIC and BIC based approaches for SVM parameter value estimation with RBF kernels from 2012 by Demyanov et al. so there are definitely people in machine learning thinking about these things.

So that's my opinion, at least: there aren't any fundamental differences but in practice there are a lot of modeling decisions that we're not just going to cross validate over so it's nice to have other tools for them. Sometimes it's easy criteria like *IC (these don't require fitting a model on multiple subsets so they are pretty convenient if you're not basing your life on them), other times graphical assessments of a model or scientific concerns, and other times we can reduce it to a numerical optimization.

## Best Answer

Bayesian optimization (BO) is a recipe to tell how you should explore the hyper-parameter (HP) space. So, you'll be using your validation set as before, but exploring the space in the direction that BO suggests. This is typically useful when you have a lot of HPs to tune and it's computationally expensive (or impossible) to try them all. Grid search explores the HP space in an ordered manner; random search just randomly picks some HP values but BO selects the next HPs to try based on how much information it'd gain from trying that HPs with your model. It's much more economic for a large and high dimensional HP space.