Hyperparameter Tuning – Is Hyperparameter Tuning on Sample of Dataset a Bad Idea?

cross-validationhyperparametermachine learning

I have a dataset of 140000 examples and 30 features for which I am training several classifiers for a binary classification (SVM, Logistic Regression, Random Forest etc)

In many cases hyperparameter tuning on the whole dataset using either Grid or Random search is too costly time-wise.

I started using the following technique

  • Sub sample my dataset
  • Use the obtained fraction to tune the hyperparameters on
  • Use the obtained parameters to train a model using the whole dataset

To evaluate each set of parameters on the second step I use sklearn's GridSearchCV with cv=10.
To evaluate the final model that I create in the third step I use sklearn's cross_val_predict. In that sense I evaluate my models leaving a 10% percent of data out, I train on the rest and measure the predictive accuracy on the 10%, iteratively 10 times, then taking the average of the scores.

What made me worry is that the prediction accuracy I get from training on my whole dataset, is really close to the evaluation I get when tuning the parameters for the best set of parameters (each tested set of parameters outputs a score obtained from averaging 10-fold-cross validation results).

Most of the times the accuracy that cross_val_predict measured using all the training examples (whole dataset) is a little bit above what the evaluation of the best parameters returned.

To illustrate this here is the evaluation of a set of parameters (on a smaller dataset than what I described above but the effect is the same)

Best parameters set found on development set:
{'kernel': 'rbf', 'C': 9, 'gamma': 0.1}
Scores for all sets of parameters
0.851 (+/-0.006) for {'kernel': 'rbf', 'C': 3, 'gamma': 0.5}
0.852 (+/-0.006) for {'kernel': 'rbf', 'C': 3, 'gamma': 0.1}
0.829 (+/-0.006) for {'kernel': 'rbf', 'C': 3, 'gamma': 0.001}
0.853 (+/-0.006) for {'kernel': 'rbf', 'C': 9, 'gamma': 0.1}
...

And here are the averaged scores (from cross_val_predict) I got from training on my whole dataset using the best parameters

precision    recall  f1-score   support

      0       0.86      0.85      0.86     15417
      1       0.86      0.87      0.87     16561

avg / total       0.86      0.86      0.86     31978

acc score: 0.863750078179
roc au score: 0.863370490059
[[13147  2270]
 [ 2087 14474]]

As you can see training on the whole dataset improves the results. I have also validated that badly tuned model (e.g. using the default values or random values for C and gamma) leads to much worse prediction accuracy.

Overall I think that tuning the hyperparameters on a subset is not ideal but can potentially lead to relatively good results without having to wait too long. I for example before using that approach used optunity package for tuning the hyperparameter on the whole dataset. This procedure would take 3-5 days to complete and would produce results that either had really good precision or really good recall but not both, so although for each class either the precision or the recall was really high (higher than what any of my other classifiers had achieved) the f1 meassure was really low. In the contrary using the later approach leads to some hours of training and a better f1 meassure.

My concerns are:

Do I limit my classification accuracy? Do I avoid using all the prediction power that my dataset can offer by tuning only on a subset? If such a harm of performance is happening is it somehow limited by some factor?

Best Answer

In addition to Jim's (+1) answer: For some classifiers, the hyper-parameter values are dependent on the number of training examples, for instance for a linear SVM, the primal optimization problem is

$\mathrm{min} \frac12\|w\|^2 + C\sum_{i=1}^\ell \xi_i$

subject to

$y_i(x_i\cdot w _ b) \geq 1 - \xi_i, \quad \mathrm{and} \quad \xi_i \geq 0 \quad \forall i$

Note that the optimisation problem is basically a measure of the data mis-fit term (the summation over $\xi_i$) and a regularisation term, but the usual regrularisation parameter is placed with the data misfit term. Obviously the greater the number of training patterns we have, the larger the summation will be and the smaller $C$ ought to be to maintain the same balance with the magnitude of the weights.

Some implementations of the SVM reparameterise as

$\mathrm{min} \frac12\|w\|^2 + \frac{C}{\ell}\sum_{i=1}^\ell \xi_i$

in order to compensate, but some don't. So an additional point to consider is whether the optimal hyper-parameters depend on the number of training examples or not.

I agree with Jim that overfitting the model selection criterion is likely to be more of an issue, but if you have enough data even in the subsample then this may not be a substantial issue.