Solved – Cross-validation with Boosting Trees (do I need 4 sets?)

boostingcross-validation

Normally, you have train, validation and test sets for training, tuning (hyperparameters) and finally evaluating a machine-learning model.
If we use cross-validation, then we can effectively have only train and test sets, where train set is used for CV (both training and tuning, using several folds). And after the best model is selected, it is trained on the whole train set and evaluated on test. Am I correct here?

Now, I am a bit confused about the boosting trees. Let's take LightGBM as an example.
When you are training a boosting tree, normally you set some value as an early_stopping_rounds and you do a separate CV to define the optimal number of trees in the model (e.g. with LGBM it is lightgbm.cv method).
Does it mean, that I need again three datasets?

  1. Train (to find the optimal number of trees using CV and then train on the whole train set, using the found optimal number of trees)
  2. Validation (tune other hyperparameters)
  3. Test (final evaluation)

Best Answer

You do not need a separate validation set; lightgbm.cv will separate the training set automatically for you for the cross validation folds, you just need to give the parameter stating the number of folds as like nfolds = 3 to lightgbm.cv

I think you can refer here for a proper application:

[https://stackoverflow.com/questions/49774825/python-lightgbm-cross-validation-how-to-use-lightgbm-cv-for-regression][1]

However, you are correct about the concept of cross-validation considering your first question.

Have fun.