Scikit-Learn Classifier Calibration – Correct Way with CalibratedClassifierCV

calibrationcross-validationscikit learntrainvalidation

Scikit has CalibratedClassifierCV, which allows us to calibrate our models on a particular X, y pair. It also states clearly that data for fitting the classifier and for calibrating it must be disjoint.

If they must be disjoint, is it legitimate to train the classifier with the following?

model = CalibratedClassifierCV(my_classifier)
model.fit(X_train, y_train)

I fear that by using the same training set I'm breaking the disjoint data rule. An alternative might be to have a validation set

my_classifier.fit(X_train, y_train)
model = CalibratedClassifierCV(my_classifier, cv='prefit')
model.fit(X_valid, y_valid)

Which has the disadvantage of leaving less data for training. Also, if CalibratedClassifierCV should only be fit on models fit on a different training set, why would it's default options be cv=3, which will also fit the base estimator? Does the cross validation handle the disjoint rule on its own?

Question: what is the correct way to use CalibratedClassifierCV?

Best Answer

There are two things mentioned in the CalibratedClassifierCV docs that hint towards the ways it can be used:

base_estimator: If cv=prefit, the classifier must have been fit already on data.

cv: If “prefit” is passed, it is assumed that base_estimator has been fitted already and all data is used for calibration.

I may obviously be interpreting this wrong, but it appears you can use the CCCV (short for CalibratedClassifierCV) in two ways:

Number one:

You train your model as usual, your_model.fit(X_train, y_train).
Then, you create your CCCV instance, your_cccv = CalibratedClassifierCV(your_model, cv='prefit'). Notice you set cv to flag that your model has already been fit.
Finally, you call your_cccv.fit(X_validation, y_validation). This validation data is used solely for calibration purposes.

Number two:

You have a new, untrained model.
Then you create your_cccv=CalibratedClassifierCV(your_untrained_model, cv=3). Notice cv is now the number of folds.
Finally, you call your_cccv.fit(X, y). Because your model is untrained, X and y have to be used for both training and calibration. The way to ensure the data is 'disjoint' is cross validation: for any given fold, CCCV will split X and y into your training and calibration data, so they do not overlap.

TLDR: Method one allows you to control what is used for training and for calibration. Method two uses cross validation to try and make the most out of your data for both purposes.

Related Solutions

Solved – How to use scikit-learn’s cross validation functions on multi-label classifiers

Stratified sampling means that the class membership distribution is preserved in your KFold sampling. This doesn't make a lot of sense in the multilabel case where your target vector might have more than one label per observation.

There are two possible interpretations of stratified in this sense.

For $n$ labels where at least one of them is filled that gives you $\sum\limits_{i=1}^n2^n$ unique labels. You could perform stratified sampling on the each of the unique label bins.

The other option is to try and segment the training data s.t. that probability mass of the distribution of the label vectors is approximately the same over the folds. E.g.

import numpy as np

np.random.seed(1)
y = np.random.randint(0, 2, (5000, 5))
y = y[np.where(y.sum(axis=1) != 0)[0]]


def proba_mass_split(y, folds=7):
    obs, classes = y.shape
    dist = y.sum(axis=0).astype('float')
    dist /= dist.sum()
    index_list = []
    fold_dist = np.zeros((folds, classes), dtype='float')
    for _ in xrange(folds):
        index_list.append([])
    for i in xrange(obs):
        if i < folds:
            target_fold = i
        else:
            normed_folds = fold_dist.T / fold_dist.sum(axis=1)
            how_off = normed_folds.T - dist
            target_fold = np.argmin(np.dot((y[i] - .5).reshape(1, -1), how_off.T))
        fold_dist[target_fold] += y[i]
        index_list[target_fold].append(i)
    print("Fold distributions are")
    print(fold_dist)
    return index_list

if __name__ == '__main__':
    proba_mass_split(y)

To get the normal training, testing indices that KFold produces you want to rewrite that to it returns the np.setdiff1d of each index with np.arange(y.shape[0]), then wrap that in a class with an iter method.

Solved – How to use k-fold cross validation in naive bayes classifier

You are very close to understanding k-fold cross-validation. To answer your questions in turn.

1. So to use k-fold cross validation the required data is the labeled data?

Yes, you must have some 'known' result in order for your model to be trained on the data. You are building a model, I assume, to predict some sort of outcome either regression or classification. In order to do so, a model must be built on data to explain some known result.

2. How about non labeled data?

For k-fold cross-validation, you will have split your data into k groups (e.g. 10). You then select one of those groups and use the model (built from your training data) to predict the 'labels' of this testing group. Once you have your model built and cross-validated, then it can be used to predict data that don't currently have labels. The cross-validation is a means to prevent overfitting.

As a last clarification, you aren't only using 1 of the 10 groups. Let's say you had 100 samples. You split it into groups 1-10, 11-20, ... 91-100. You would first train on all the groups from 11-100 and predict the test group 1-10. Then you would repeat the same analysis on 1-10 and 21-100 as the training and 11-20 as the testing group and so forth. The results typically averaged at the end.

As a simple example say I have the following abbreviated data (binary classification):

Label    Variable
A        0.354
A        0.487
A        0.384
A        0.395
A        0.436
B        0.365
B        0.318
B        0.327
B        0.381
B        0.355

Let's say I want to do 10-fold cross-validation on this (nearly Leave-One-Out cross-validation in this case)

My first testing group will be:

A        0.354
A        0.487

My training set is the remaining data. See how the labels are present in both groups?

A        0.384
A        0.395
A        0.436
B        0.365
B        0.318
B        0.327
B        0.381
B        0.355

Please note that it is also best practice to randomize the grouping, this is purely for demonstration

Then you fit your model to the training set, which is using the variable(s) to best explain the labels (class A or B). The model that has been fit to this training set is then used to predict the testing dataset. You remove the labels from the testing set and predict them using the trained model. You then compare the predicted labels to the actual labels. This is repeated for all 10-folds and the results averaged.

Once everything is completed and you have your wonderfully cross-validated model, you can use it to predict unlabeled data and have some sort measure of confidence in your results.

Extended for Parameter Tuning

Let's say you are tuning a partial least squares (PLS) model (it doesn't matter if you don't know what this is for demonstration purposes). I would like determine how many components (the tuning parameter) I should have in the model. I would like to test 2,3,4, and 5 components and see how many I should use to maximize my predictive accuracy without overfitting the model. I would conduct the entire cross-validation series for each component number. Each iteration of the CV would be averaged (the average predictive accuracy of the entire analysis).

Assuming classification accuracy is your metric let's say these are my results (completely made up here):

2 components: 70%
3 components: 82%
4 components: 78%
5 components: 74%

Clearly, I would then choose 3 components for my model which has now been cross-validated to avoid overfitting and maximizing predictive accuracy. I can then use this optimized model to predict a new dataset where I don't know the labels.

Best Answer

Related Solutions

Solved – How to use scikit-learn’s cross validation functions on multi-label classifiers

Solved – How to use k-fold cross validation in naive bayes classifier

Related Question