Solved – Training threshold vs validation threshold for better prediction results

h2omachine learningmodel selectionthreshold

Between the two, should I use a model's training or validation threshold to get best results (from a distributed random forest binary classifier built using h2o.ai) (especially when their values differ by orders of magnitude)?

Details:

Used h2o flow UI to create a RDF model and examined the testing and validation thresholds optimized for maximizing the minimum per class accuracy in each set (since the classes were imbalanced), eg. for the validation set …
enter image description here
Now I have a set of new data that I would like to make a prediction on using the exported model. The RDF is just going to output probability values for both of the response classes, so I need to choose a threshold (in my case, for the max_min_per_class_accuracy) to use as a cutoff point to delineate them.

Don't know much about machine learning, but to me, I would think that using the validation threshold would be better since the threshold that worked best with the training set would be too optimistic and overfit. However, the default threshold used by the h2o python API is for the training set which makes me question my own preference (as again, I have little experience). So which should I use, the threshold for the training set or for the validation set? Thanks

Is this a wrong-headed question to be asking in the first place or is there something about this question that shows some general confusion about concepts? Let me know if so.

Best Answer

I like to use the average of the best threshold on each of training and validation data set.

This is figure 4-2 from Practical Machine Learning with H2O (Disclaimer: my book): enter image description here

It shows the best threshold on each of train, valid and test. The dashed line is the average of the two. (Aside: valid and test lines are noisier, as they are based on less data: the split was 13:1:1 )

The point is that any of these (training, valid, their average) will be in the right ballpark.

The reason I use the average is because the training data is 13x more data, but might have over-fitted, so I hope that the validation data can balance it. I give them equal weight because this particular example is time series, so the validation data is newer (closer to the test data) than the training data. (In this particular example, using the average is worse than using the training threshold, but better than the validation threshold.)

Related Question