Machine Learning – How to Calibrate Models with Limited Data

calibrationclassificationmachine learningrandom forestvalidation

I am working on random forest classifiation with a dataset size of 977 records and 6 features. However, my class is imbalanced and proportion is 77:23

I was reading about calibration of models (binary classification) to improve/calibrate the predicted probabilities of actually fitted model (RF in this case).

However, I also found out that calibration model has to be fit using a different dataset.

But the problem is, I already used sklearn train and test split – 680 records for my train and 297 records for my test (of random forest model)

Now, how can I calibrate my model (as I don't have any new data)

Especially, as I am using Random forest, I wish to calibrate my model for better predicted probabilities?

If you ae interested to look at my calibration curve and brier score loss, please find below

enter image description here

update – extra trees classifier

enter image description here

update – logistic regression

enter image description here

update – bootstrap optimisim

enter image description here

Best Answer

I also found out that calibration model has to be fit using a different dataset.

That's not strictly true. As Frank Harrell explains, with data sets of this size it's generally best to develop the model on the entire data set and then validate the modeling process by repeating the modeling on multiple bootstrap samples and evaluating performance on the full data set. (Repeated cross validation, as suggested by usεr11852, can also work for this.) That allows evaluation of and correction for bias, and production of calibration curves that are likely to represent the quality of the model when applied to new data samples from the population. This presentation outlines the procedure in the context of logistic regression, but the principles are general.

Related Question