R h2o – How Does h2o.r Cross Validation Work?

h2or

For GBM and randomForests:

I understand, that when I set nfolds to 10, the training frame is divided into 2 sets, first having 90% of rows, and second having 10% of rows randomly for first cross validation model, the model is built on these 90% of rows (first set) and validated against the remaining 10% of rows (second set) to provide measures of accuracy i.e. AUC.

This is how the first cross validation (CV) model is built.

The question is, for the 2nd and subsequent out of 10 CV models, will the h2o randomly choose another 10% of the entire training frame (without replacement – so each row is only used once for validation and 9 times for training; or with replacement, any row can be either part of validation or training dataset for each of CV models) ?

To rephrase the question:

Let's say I have a training frame of 10 rows and do a 5 folds cv:

My first CV fold can use rows 1-8 for training and 9-10 for validation,

Second can use only use for validation any out of rows which were not yet used for validation for any of previously build CV models (here any rows besides 9-10)

?

Best Answer

H2O uses k-fold cross-validation, which is defined as partitioning the dataset into k discrete (non-overlapping) subsets. This occurs when nfolds is set to an integer greater than 1.

This partitioning of the data into k partitions only happens once. So for the second, third, etc iterations of cross-validation, it will not randomly divide the data again, it will re-assemble the existing partitions systematically so that each partition will be used as the validation set exactly once.

In H2O, when nfolds is set to 10, for example, then a total of 11 models are trained -- the 10 CV models and then a final model on the full training set, which can be used to generate predictions on future/test data.

If you want to control how the data is partitioned, you can do that by passing your own partitioning via the fold_column argument, or use the fold_assignment argument to control the type of automatic partitioning. If you need to know which rows were used in which partition, set keep_cross_validation_fold_assignment = TRUE and a single-column frame containing the fold ids (1-10 for 10-fold) for each row will be stored in the model output.