Scikit-Learn Classifier Calibration – Correct Way with CalibratedClassifierCV

calibrationcross-validationscikit learntrainvalidation

Scikit has CalibratedClassifierCV, which allows us to calibrate our models on a particular X, y pair. It also states clearly that data for fitting the classifier and for calibrating it must be disjoint.

If they must be disjoint, is it legitimate to train the classifier with the following?

model = CalibratedClassifierCV(my_classifier)
model.fit(X_train, y_train)

I fear that by using the same training set I'm breaking the disjoint data rule. An alternative might be to have a validation set

my_classifier.fit(X_train, y_train)
model = CalibratedClassifierCV(my_classifier, cv='prefit')
model.fit(X_valid, y_valid)

Which has the disadvantage of leaving less data for training. Also, if CalibratedClassifierCV should only be fit on models fit on a different training set, why would it's default options be cv=3, which will also fit the base estimator? Does the cross validation handle the disjoint rule on its own?

Question: what is the correct way to use CalibratedClassifierCV?

Best Answer

There are two things mentioned in the CalibratedClassifierCV docs that hint towards the ways it can be used:

base_estimator: If cv=prefit, the classifier must have been fit already on data.

cv: If “prefit” is passed, it is assumed that base_estimator has been fitted already and all data is used for calibration.

I may obviously be interpreting this wrong, but it appears you can use the CCCV (short for CalibratedClassifierCV) in two ways:

Number one:

  • You train your model as usual, your_model.fit(X_train, y_train).
  • Then, you create your CCCV instance, your_cccv = CalibratedClassifierCV(your_model, cv='prefit'). Notice you set cv to flag that your model has already been fit.
  • Finally, you call your_cccv.fit(X_validation, y_validation). This validation data is used solely for calibration purposes.

Number two:

  • You have a new, untrained model.
  • Then you create your_cccv=CalibratedClassifierCV(your_untrained_model, cv=3). Notice cv is now the number of folds.
  • Finally, you call your_cccv.fit(X, y). Because your model is untrained, X and y have to be used for both training and calibration. The way to ensure the data is 'disjoint' is cross validation: for any given fold, CCCV will split X and y into your training and calibration data, so they do not overlap.

TLDR: Method one allows you to control what is used for training and for calibration. Method two uses cross validation to try and make the most out of your data for both purposes.

Related Question