Solved – Understanding the effect of hyperparameters in machine learning experiments

data mininghyperparametermachine learningoptimization

In machine learning every algorithm has a set of hyperparameters which needs to be optimized for best prediction performance. The simplest method for this optimization is called grid search which means to try all possible parameter value combinations. In this way one can find the best parameter values.

However, this does not give any insight about the interaction of parameter values. For example suppose that we have 5 hyperparameters p1, p2, p3, p4, and p5. There might be these kind of facts: as the value of p1 increases given that p3's value is low prediction performance increases. However, if p3's value is high then p1's value has no effect. There might be many more interesting facts like these. Is there any method for finding these kind of facts?

Best Answer

Many dedicated optimization methods exist for hyperparameter tuning. Sequential model based optimization (a Bayesian inspired method) is a particularly popular research topic, for instance here. Metaheuristic approaches like genetic algorithms, particle swarm optimization and simulated annealing are also common, see for instance here.

If you want to model the effect of hyperparameters, random search is a good sampling strategy to start from.

You can find implementations of such optimization methods in tuning libraries like Optunity, HyperOpt and Spearmint.

Related Solutions

Solved – How to choose train/test sample ratio, for machine learning

I think you are describing nested cross validation and you can use it to select your best hyperparameters. R already has some packages implementing this, for example for support vector machines you could use the package e1071 and do something like this assuming you have two independent variables:

svmTuning <- tune.svm(Y~X1+X2.,type="nu-regression",kernel="radial", data = dat, gamma = seq(from=0,to=3,by=0.1), cost=seq(from=2,to=16,by=2),
                  tunecontrol= tune.control(sampling="cross",cross=1000))

If you had 1000 observations the previous would perform leave-one-out cross validation, sweeping through the possible combinations of selected gammas and costs (but only one kernel in this case). You can see the best parameters by doing:

svmTuning$best.parameters

I'm pretty sure the optimal is chosen using the mean squared error calculated based on the cross validation you chose (in the case of regression) and average classification error.

Here's another example with kernel k-nearest neighbours

knnTuning <- train.kknn(Y~X1+X2., data=dat, kmax = 40, distance = 2, kernel = c("rectangular", "triangular", "epanechnikov","gaussian", "rank", "optimal"),
       ykernel = NULL, scale = TRUE,kcv=1000)

Which sweeps through all combinations of neighbors up to 40 and the different kernels but using the euclidean distance (distance=2). You may plot all these results and again obtain the best parameters:

plot(knnTuning)
knnTuning$best.parameters

You could do the same for random forest:

rfTuning <- tune.randomForest(Y~X1+X2, data = dat,ntree=1000, mtry=seq(from=2,to=10,by=1),
                  tunecontrol= tune.control(sampling="cross",cross=1000))

Where you just sweep through possible values for the amount of variables in the candidates for each split. This is known to overfit if not done carefully.

And so on and so forth. Since you appear to have a small sample size maybe leave-one-out is the way to go. Maybe you can also look into the caret package which has good capabilities for model building and the actual documentation is very solid (theoretical descriptions and all).

Solved – Grid Search for hyperparameter and feature selection

The most important downside for searching along single parameters instead of optimizing them all together is that you ignore interactions. It is quite common that e.g. more than one parameter influences model complexity. In that case, you need to look at the interaction in order to sucessfully optimize the hyperparameters.

Depending on how large your data set is and how many models you compare, optimization strategies that return the maximum observed performance run into trouble (true for both grid search and your strategy). The reason is that searching through a large number of performance estimates for the maximum "skims" the variance of the performance estimate: you may just end up with a model and train/test split combination that accidentally happens to look good. Even worse, you may get several perfect looking combinations, and the optimization then cannot know which model to choose and thus becomes unstable.

Best Answer

Related Solutions

Solved – How to choose train/test sample ratio, for machine learning

Solved – Grid Search for hyperparameter and feature selection

Related Question