Solved – Extracting H2o Cross Validation Results

cross-validationh2or

I am using H2o library in R and have a slight confusion that you learned people might be able to help with. I am not sure how to interpret the output from h2o.cross_validation_predictions() as it appears to have all rows and not just those used as the test measurement?

I have a dataset of 240 rows & 5 o/p classes. I have a model called deep with parameters:

nfold=FOLDS
keep_cross_validation_predictions=TRUE,
keep_cross_validation_fold_assignment=TRUE

I run the algorithm and the summary(deep) gives me: e.g.

Cross-Validation Metrics Summary: 
                              mean          sd cv_1_valid cv_2_valid
accuracy                0.40934372 0.021433437 0.43965518 0.37903225
err                      0.5906563 0.021433437  0.5603448 0.62096775
err_count                     71.0   4.2426405       65.0       77.0
logloss                  1.9856012 0.049440455  2.0555205  1.9156818
max_per_class_error            1.0         0.0        1.0        1.0

I can see the results for fold 1 and fold 2.

I want to analyse the predictions for each fold myself. I use:

h2o.cross_validation_predictions(deep)

This gives a list of two elements, one for each fold BUT why are there 240 rows? Each fold should split to 120 train and 120 test, so I had anticipated these results to be for the 120 records of the test?

[[1]]
  predict        p1         p2         p3         p4           p5
1       1 0.8608196 0.01197437 0.02884958 0.07538831 0.0229681450
2       1 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000000
3       1 0.7204473 0.08612677 0.08648538 0.09837615 0.0085644254
4       1 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000000
5       2 0.3437493 0.48853368 0.14723671 0.01996180 0.0005185645
6       2 0.2155752 0.63492769 0.13477110 0.01435626 0.0003697219

[240 rows x 6 columns] 

[[2]]
  predict          p1        p2         p3         p4           p5
1       1 0.000000000 0.0000000 0.00000000 0.00000000 0.0000000000
2       2 0.013206830 0.9334008 0.02128286 0.03146788 0.0006416512
3       1 0.000000000 0.0000000 0.00000000 0.00000000 0.0000000000
4       2 0.006375454 0.9370140 0.03062428 0.02546406 0.0005222259
5       1 0.000000000 0.0000000 0.00000000 0.00000000 0.0000000000
6       1 0.000000000 0.0000000 0.00000000 0.00000000 0.0000000000

[240 rows x 6 columns]

I can use h2o.cross_validation_fold_assignment(deep), which gives me:

  fold_assignment
1               0
2               1
3               0
4               1
5               0
6               0

I assume 0 is fold 1 and 1 is fold 2 and this indicates the row in the dataset used for that fold (a little inconsistent use of starting at 0 or at 1).

Do I filter the results from h2o.cross_validation_predictions(deep), e.g. for [[1]] I select all the records indicated as “0” in the fold assignment?

Will this then give me just the records used to calculate the metrics for that fold?

I have tried the documentation and numerous searches – it is almost certainly my lack of ability – but help would be appreciated!

Best Answer

Here is the part of the documentation that should answer your question.

For your convenience these are the key pieces of that section of the user guide:

Each cv-model produces a prediction frame pertaining to its fold. It can be saved and probed from the various clients if keep_cross_validation_predictions parameter is set in the model constructor. These holdout predictions have some interesting properties.[...]

and they contain, unsurprisingly, predictions for the data held out in the fold. They also have the same number of rows as the entire input training frame with 0s filled in for all rows that are not in the hold out.