Solved – Binary classification: single label probability based metric/calibration

binary datacalibrationclassificationmetricprediction

Situation

I have a data-set (15-20k) with two classes. I can train a classifier on both classes, but am only allowed to test/predict on one class. The data-set is not balanced (~1:4).

Goal

I want to find out, how much the classifier was able to learn from the data-set and am therefore i am interested in the predicted probabilities of that one class I can test on resp. their "distribution".

Problem

The TPR, for example exists, but uses only the predicted labels (not the "probabilities"). Having not well balanced sets and not calibrated classifiers, this does not seem to be optimal.

Question

Is there a good metric available, that takes the predicted "probabilities" (without calibration, we may don't even speak of probabilities…) of only one class (+ true label) and returns a meaning-full score? Or is it possible to calibrate the output of a classifier by using only one class to test on (so that the predictions are more meaning-full)?

Best Answer

I recommend you look into cost curves. These (shown on the right of the figure below) display the normalized expected cost (i.e., error) at different probability costs (i.e., class probability or cost function). This will not give a single score necessarily but will show the range of performance.

Drummond, C., & Holte, R. C. (2006). Cost curves: An improved method for visualizing classifier performance. Machine Learning, 65(1), 95–130.

Related Solutions

Solved – Probability calibration metric for multiclass classifier

Following Guo et al., I ended up using the Expected Calibration Error, defined as $$\sum_{m=1}^M\frac{|{B_{m}|}}{n}\left|acc(B_m) - conf(B_m)\right|$$

In extending this to multiclass, one can either take the maximum probability for each prediction, or average across the top $n$ predictions, if desired.

Solved – Proper way to incorporated CalibratedClassifierCV in cross-validation in Scikit

I have this exact problem. Well, in fact I had another layer of complexity because I wanted to select also the best post-processing (for example, scaling, PCA, selecting K Best...) and I wondered if I needed another step of CV or not. I have searched a lot for answers on the internet and have not found any, so I'll try to explain what I have done. First of all, my thoughts:

It makes sense to have a nested CV scheme in order to select model and hyperparameters on the inner loop and then assess the quality of the model (in fact, of the selection procedure for the model + the model itself) on the outer loop.
It makes sense to repeat the procedure on the full dataset after you have assessed the error that you make.
The problem that calibration presents is that for calibration you need new data, as you have pointed out, so that is another issue to solve.

So I have thougth how to make sense of all of it, that is, how to select best calibrated model + best parameters (including best post-process parameters or procedure), and how to assess how good is the model in a completely unseen data. And I wanted to do it with double nested cross validation, because three layers is too much complexity.

In my view, the inner cross validation should train the model as if it were the definite one, so the calibration should enter in the inner layer. The difficult that then arises is that you don't have, in principle, new data to see how good is the model in order to select best parameters (by the way, brier loss is, as far as I know, a not very good metric). To deal with it, I have tried two things:

Define a new metric for optimization, that I have called histogram-width, and that measeures how wide the histogram of predicted probabilities is. It makes sense to me because if your model is calibrated, the more confident the predictions are, the wider the histogram will be (most of the predictions will be near 0 or 1 if the model has good predictive power). The advantage of this metric is that it does not need new data. By experiment, I have found that it's really close to ROC AUC in a hold out set.
Try common metrics, but care must be taken in order to evaluate them. As CalibratedClassifierCV is an ensemble of base estimators and calibrators, you can use the inner base estimators to check the metric in the holdout set of the same KFold. I had to do some assumptions and I think that my code is correct, but it's kind of weird.

As code is more explanatory than words, I'll post here my own code of this function that tries to do what I explained before:

Select best model + hyperparameters + parameters of a post-process pipeline with an inner Cross Validation.
Validate this procedure of model selection with an outer cross validation and generate a report in a Document of python-docx.
Repeate the procedure with the full dataset in order to build the model.

I always try to write nice code and end up with a something very messy... I'm sorry. You can ask for anything that is not clear.

import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.decomposition import PCA
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.model_selection import StratifiedKFold
from skopt.space import Real, Integer, Categorical
from skopt.utils import use_named_args
from skopt import gp_minimize
from xgboost import XGBClassifier
from imblearn.pipeline import Pipeline
from skopt.space.transformers import Identity
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_recall_curve, roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss, average_precision_score, log_loss
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
import matplotlib.pyplot as plt
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler, NearMiss, TomekLinks, NeighbourhoodCleaningRule
from imblearn.combine import SMOTEENN, SMOTETomek
import functools
import lightgbm as lgb
from docx import Document
from docx.shared import Inches
from io import BytesIO
from sklearn.metrics import confusion_matrix
import seaborn as sns
import numpy as np


def plot_calibration_curve(y, y_proba):
    fig, ax = plt.subplots()
    prob_true_1, prob_pred_1 = calibration_curve(y, y_proba, n_bins=12)
    ax.plot([0, 1], [0, 1], linestyle='--', label='Perfect calibration')
    ax.plot(prob_true_1, prob_pred_1, marker='.')
    ax.set(xlabel='Average predicted probability in each bin', ylabel='Ratio of positives')
    memfile = BytesIO()
    plt.savefig(memfile)
    return memfile

def plot_precision_recall_curve(y, y_proba):
    fig, ax = plt.subplots()
    precision, recall, _ = precision_recall_curve(y, y_proba)
    average_precision = average_precision_score(y, y_proba)
    ax.plot(precision, recall, label=f'AP = {average_precision:0.2f}')
    ax.set(xlabel='Precision', ylabel='Recall')
    ax.legend(loc="lower left")
    memfile = BytesIO()
    plt.savefig(memfile)
    return memfile

def plot_roc_curve(y, y_proba):
    fig, ax = plt.subplots()
    fpr, tpr, _ = roc_curve(y, y_proba)
    roc_auc = roc_auc_score(y, y_proba)
    ax.plot(fpr, tpr, label=f'ROC = {roc_auc:0.2f}')
    ax.set(xlabel='False Positive Rate', ylabel='True Positive Rate')
    ax.legend(loc="lower right")
    memfile = BytesIO()
    plt.savefig(memfile)
    return memfile

def plot_confusion_matrix(y, y_proba):
    y_pred = y_proba > 0.5
    fig, ax = plt.subplots()
    confm = confusion_matrix(y, y_pred)
    # Normalize
    confm = confm.astype('float') / confm.sum(axis=1)[:, np.newaxis]
    ax = sns.heatmap(confm, cmap='Oranges', annot=True)
    ax.set(xlabel='Predicted label', ylabel='True label')
    memfile = BytesIO()
    plt.savefig(memfile)
    return memfile

def neg_score(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        score = func(*args, **kwargs)
        return 1.0 - score
    return wrapper

class MyIdentity(Identity):
    def fit(self, X, y):
        return self

# It does not work if any of the pipelines in the
# dict_pipelines has a resampler transformer.
# In this case, this has to go in a different step.
class OptionedPostProcessTransformer(TransformerMixin):

    def __init__(self, dict_pipelines):
        self.dict_pipelines = dict_pipelines
        self.option = list(dict_pipelines.keys())[0]
        super().__init__()

    def fit(self, X, y=None):
        self.dict_pipelines[self.option].fit(X, y)
        return self

    def set_params(self, **params):
        self.option = params['option']
        return self

    def transform(self, X):
        return self.dict_pipelines[self.option].transform(X)

    def fit_transform(self, X, y=None):
        return self.dict_pipelines[self.option].fit_transform(X, y)

def histogram_width(y_true, y_proba):
    return 4 * (np.sum((y_proba - 0.5) ** 2) / len(y_proba))

dict_resamplings = {
    'random_over': RandomOverSampler(),
    'smote': SMOTE(),
    'adasyn': ADASYN(),
    'random_under': RandomUnderSampler(),
    'nearmiss': NearMiss(version=3, n_neighbors_ver3=3),
    'tomeklinks': TomekLinks(),
    'ncr': NeighbourhoodCleaningRule(),
    'smotetomek': SMOTETomek(),
    'smoteenn': SMOTEENN(sampling_strategy='minority')
}

dict_metrics_loss = {
    'roc_auc': neg_score(roc_auc_score),
    'log_loss': log_loss,
    'pr_auc': neg_score(average_precision_score),
    'brier_loss': brier_score_loss,
    'histogram_width': neg_score(histogram_width)
}

dict_pipelines_post_process = {
        'option_1': Pipeline([
            ('scale', StandardScaler()),
            ('reduce_dims', PCA(n_components=50))
        ]),
        'option_2': Pipeline([
            ('scale', StandardScaler()),
            ('reduce_dims', SelectKBest(mutual_info_classif, k=100))
        ]),
        'option_3': Pipeline([
            ('identity', MyIdentity())
            ])
}

dict_models_example = {
    'gradient_boosting': {
        'model': GradientBoostingClassifier(),
        'pipeline_post_process': Pipeline([("identity", MyIdentity())]),
        'search_space': [
            Integer(4, 12, name='model__max_depth'),
            Integer(10, 500, name='model__n_estimators'),
            Real(0.001, 0.15, prior='log-uniform', name='model__learning_rate'),
            Real(0.005, 0.10, prior='log-uniform', name='model__min_samples_split'),
            Real(0.005, 0.10, prior='log-uniform', name='model__min_samples_leaf'),
            Real(0.8, 1, prior='log-uniform', name='model__subsample')
        ]
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'pipeline_post_process': Pipeline([
            ('scale', StandardScaler()),
            ('reduce_dims', PCA(n_components=50))
        ]),
        'search_space': [
            Integer(30, 100, name='reduce_dims__n_components'),
            Integer(0, 1, name='model__bootstrap'),
            Integer(10, 1000, name='model__n_estimators'),
            Integer(5, 15, name='model__max_depth'),
            Integer(5, 50, name='model__min_samples_split'),
            Integer(1,4, name='model__min_samples_leaf'),
            Categorical(['auto', 'sqrt'], name='model__max_features'),
            Categorical(['balanced', 'balanced_subsample'], name='model__class_weight')
        ]
    },
    'xgboost': {
        'model': XGBClassifier(),
        'pipeline_post_process': Pipeline([
            ('post_process', OptionedPostProcessTransformer(dict_pipelines_post_process))
        ]),
        'search_space': [
            Categorical(['option_1', 'option_2', 'option_3'], name='post_process__option'),
            Integer(5, 15, name='model__max_depth'),
            Real(0.05, 0.31, prior='log-uniform', name='model__learning_rate'),
            Integer(1, 10, name='model__min_child_weight'),
            Real(0.8, 1, prior='log-uniform', name='model__subsample'),
            Real(0.13, 0.8, prior='log-uniform', name='model__colsample_bytree'),
            Real(0.1, 10,prior='log-uniform', name='model__scale_pos_weight'),
            Categorical(['binary:logistic'], name='model__objective')
        ]
    },
    'lightgbm': {
        'model': lgb.LGBMClassifier(),
        'pipeline_post_process': Pipeline([("identity", MyIdentity())]),
        'search_space': [
            Real(0.01, 0.5, prior='log-uniform', name='model__learning_rate'),
            Integer(1, 30, name='model__max_depth'),
            Integer(10, 400, name='model__num_leaves'),
            Real(0.1, 1.0, prior='uniform', name='model__feature_fraction'),
            Real(0.1, 1.0, prior='uniform', name='model__subsample'),
            Categorical(['balanced'], name='model__class_weight'),
            Categorical(['binary'], name='model__objective')
        ]
    }
}

def find_best_calibrated_binary_model(
        X,
        y,
        k_outer_fold=5,
        k_inner_fold=2,
        report_doc=None,
        n_initial_points=5,
        n_calls=10,
        dict_model_params=dict_models_example,
        metric='histogram_width',
        peeking_metrics=None,
        verbose=False,
        skopt_func=gp_minimize
    ):
    """Finds best binary calibrated classification model and optionally
    generate a report doing a nested cross validation. In the inner
    cross validation, doing a Bayesian Search, the best parameters are found.
    In the outer cross validation, the model is validated.
    Finally, the whole procedure is used for the full dataset to return
    the best possible model.


    Parameters
    ----------
    X : np.array
        Feature set.

    y : np.array
        Classification target to predict.

    k_outer_fold : int, default=4
        Number of folds for the outer cross-validation.

    k_inner_fold : int, default=4
        Number of folds for the inner cross-validation.

    report_doc : Document or None
        Document used to write report of training.

    n_initial_points : int, default=5
        Number of initial points to use in Bayesian Optimization.

    n_calls : int, default=5
        Number of additional calls to use in Bayesian Optimization.

    dict_models_params : Dict[str : List[List[skopt.Space]]
        Dict of models to try inside of the inner loops. For each model, there is
        the corresponding list of space objects to delimit where the parameters live,
        including the pipeline postprocess to make.

    metric : str, default='auc'
        Metric to use in order to find best parameters in Bayesian Search. Options:
        - auc
        - pr_auc
        - brier_loss
        - wide_histogram

    peeking_metrics : List[str], default=None
        If not None, in the report there will be a comparison between the metric of
        evaluation on the inner fold and the list of metrics in peeking_metrics on the
        outer fold. This can be used to assess the quality of the metric used, but could
        lead to underestimate the error if not taken proper care.

    verbose : bool, default=False
        If True, you can trace the progress in the terminal.

    skopt_func : callable, default=gp_minimize
        Minimization function of the skopt library to be used.

    Returns
    -------
    model : Model trained with the full dataset using the same procedure
    as in the inner cross validation.
    document : If report_func is not none, a document with the full report.
    """
    outer_cv = StratifiedKFold(n_splits=k_outer_fold)
    k=0
    for train_index, test_index in outer_cv.split(X, y):
        report_doc.add_heading(f'Report of training in fold {k} of outer Cross Validation', level=1)
        inner_model = train_inner_calibrated_binary_model(
            X=X[train_index], y=y[train_index], k_inner_fold=k_inner_fold,
            X_hold_out=X[test_index], y_hold_out=y[test_index],
            report_doc=report_doc, n_initial_points=n_initial_points,
            n_calls=n_calls, dict_model_params=dict_model_params,
            metric=metric, peeking_metrics=peeking_metrics, verbose=verbose,
            skopt_func=skopt_func)
        report_doc.add_heading(f'Report of validation in fold {k} of outer Cross Validation', level=1)
        evaluate_model(
            model=inner_model, X=X[test_index], y=y[test_index],
            report_doc=report_doc, peeking_metrics=peeking_metrics
        )
        k += 1
    # After assessing the procedure, we repeat it on the full dataset:
    return train_inner_calibrated_binary_model(
            X=X, y=y, k_inner_fold=k_inner_fold,
            report_doc=None, n_initial_points=n_initial_points,
            n_calls=n_calls, dict_model_params=dict_model_params,
            metric=metric, verbose=verbose, skopt_func=skopt_func)

def evaluate_model(model, X, y, peeking_metrics=None, report_doc=None):
    y_proba = model.predict_proba(X)[:, 1]
    if peeking_metrics:
        report_doc.add_heading(f'Main metrics', level=2)
        for metric in peeking_metrics:
            report_doc.add_paragraph(f"Metric {metric} is {dict_metrics_loss[metric](y, y_proba)}\n")
    report_doc.add_heading(f'Main plots', level=2)

    # Plot calibration curve
    report_doc.add_paragraph('Calibration plot')
    memfile = plot_calibration_curve(y, y_proba)
    document.add_picture(memfile, width=Inches(5))
    memfile.close()

    # Plot precision recall curve
    report_doc.add_paragraph('Precision-recall curve plot')
    memfile = plot_precision_recall_curve(y, y_proba)
    document.add_picture(memfile, width=Inches(5))
    memfile.close()

    # Plot roc curve
    memfile = plot_roc_curve(y, y_proba)
    report_doc.add_paragraph('ROC curve plot')
    document.add_picture(memfile, width=Inches(5))
    memfile.close()

    # Plot confussion matrix
    memfile = plot_confusion_matrix(y, y_proba)
    report_doc.add_paragraph('Confusion matrix')
    document.add_picture(memfile, width=Inches(5))
    memfile.close()

    return


def evaluate_metric_cv(score_func, calibrated_model, X, y, k_inner_fold, greater_is_better=False):
    scores = []
    inner_cv = StratifiedKFold(n_splits=k_inner_fold)
    for _, test_index in inner_cv.split(X, y):
        candidate_scores = []
        for classifier in calibrated_model.calibrated_classifiers_:
            X_hold_out = X[test_index]
            y_hold_out = y[test_index]
            y_proba = classifier.base_estimator.predict_proba(X_hold_out)[:, 1]
            candidate_scores.append(score_func(y_hold_out, y_proba))
        # As the order is of the trained base classifiers is not guaranteed,
        # we assume that the worst score is of the classifier that was not trained with this subset
        if greater_is_better:
            scores.append(min(candidate_scores))
        else:
            scores.append(max(candidate_scores))
    return sum(scores) / len(scores)

def train_inner_calibrated_binary_model(X, y, X_hold_out=None, y_hold_out=None,
                                        k_inner_fold=2, report_doc=None,
                                        n_initial_points=5, n_calls=10,
                                        dict_model_params=dict_models_example, metric='histogram_width',
                                        peeking_metrics=None, verbose=False, skopt_func=gp_minimize):
    list_models = []
    list_metrics = []
    list_comparisons = []
    score_loss_func = dict_metrics_loss[metric]

    for key in dict_model_params.keys():

        pipeline_post_process = dict_model_params[key]['pipeline_post_process']
        model = dict_model_params[key]['model']
        search_space = dict_model_params[key]['search_space']

        complete_steps = pipeline_post_process.steps + [('model', model)]
        complete_pipeline = Pipeline(complete_steps)

        @use_named_args(search_space)
        def func_to_minimize(**params):
            complete_pipeline.set_params(**params)
            if verbose:
                print(f"Optimizing model {key}\n")
                print(f"With parameters {params}\n")

            # calculate k-fold cross validation with Calibrated Classifier
            # If metric different than histogram_width, this is not valid

            calibrated_model = CalibratedClassifierCV(complete_pipeline, method='isotonic', cv=k_inner_fold)
            calibrated_model.fit(X, y)
            y_proba = calibrated_model.predict_proba(X)[:, 1]
            if metric != 'histogram_width':
                loss_score = evaluate_metric_cv(score_loss_func, calibrated_model, X, y, k_inner_fold)
            else:
                loss_score = score_loss_func(y, y_proba)

            list_models.append(calibrated_model)

            list_metrics.append(loss_score)
            if verbose:
                print(f"Metric is {loss_score}\n")

            if report_doc:
                dict_comparison = {}
                dict_comparison['model'] = key
                dict_comparison['params'] = params
                for peeking_metric in [metric] + peeking_metrics:
                    if peeking_metric == 'histogram_width':
                        inner_metric = dict_metrics_loss[peeking_metric](y, y_proba)
                    else:
                        inner_metric = evaluate_metric_cv(dict_metrics_loss[peeking_metric], calibrated_model, X, y, k_inner_fold)
                    y_hold_out_proba = calibrated_model.predict_proba(X_hold_out)[:, 1]
                    outer_metric = dict_metrics_loss[peeking_metric](y_hold_out, y_hold_out_proba)
                    dict_comparison['inner_' + peeking_metric] = inner_metric
                    dict_comparison['outer_' + peeking_metric] = outer_metric
                list_comparisons.append(dict_comparison)
            return loss_score

        # perform optimization
        skopt_func(func_to_minimize, search_space, n_initial_points=n_initial_points, n_calls=n_calls)
    index_best_model = list_metrics.index(min(list_metrics))
    best_model = list_models[index_best_model]
    if verbose:
        print("Best model found")
    if report_doc:
        comparisons_df = pd.DataFrame(list_comparisons)
        report_doc.add_heading(f'Comparison of best model with different metrics', level=2)
        report_doc.add_paragraph(f'Best model with respect to selected metric {metric} is {comparisons_df.loc[comparisons_df["inner_" + metric].idxmin()]}\n')
        for peeking_metric in peeking_metrics:
            report_doc.add_paragraph(f'Best model with respect to {peeking_metric} is {comparisons_df.loc[comparisons_df["inner_" + peeking_metric].idxmin()]}\n')
    return best_model

if __name__ == "__main__":
    dataset = pd.read_csv('D:/Python/nudge/data/04_feature/dataset.csv', **{'sep': ";", 'index_col': False, 'decimal': ','})
    index_cols = ['NIF Identificado', 'Ejercicio']
    target_col = ['target']
    dataset = dataset.sample(n=1000)
    y = dataset['target'].to_numpy()
    X = dataset[[c for c in dataset.columns if c not in index_cols + target_col]].to_numpy()
    dict_models = {
        'xgboost': {
            'model': XGBClassifier(),
            'pipeline_post_process': Pipeline([
                ('post_process', OptionedPostProcessTransformer(dict_pipelines_post_process)),
                ('resample', dict_resamplings['smote'])
            ]),
            'search_space': [
                Real(0.1, 1, name='resample__sampling_strategy'),
                Categorical(['option_1', 'option_2', 'option_3'], name='post_process__option'),
                Integer(5, 15, name='model__max_depth'),
                Real(0.05, 0.31, prior='log-uniform', name='model__learning_rate'),
                Integer(1, 10, name='model__min_child_weight'),
                Real(0.8, 1, prior='log-uniform', name='model__subsample'),
                Real(0.13, 0.8, prior='log-uniform', name='model__colsample_bytree'),
                Real(0.1, 10, prior='log-uniform', name='model__scale_pos_weight'),
                Categorical(['binary:logistic'], name='model__objective')
            ]
        },
        'lightgbm': {
            'model': lgb.LGBMClassifier(),
            'pipeline_post_process': Pipeline([("identity", MyIdentity())]),
            'search_space': [
                Real(0.01, 0.5, prior='log-uniform', name='model__learning_rate'),
                Integer(1, 30, name='model__max_depth'),
                Integer(10, 400, name='model__num_leaves'),
                Real(0.1, 1.0, prior='uniform', name='model__feature_fraction'),
                Real(0.1, 1.0, prior='uniform', name='model__subsample'),
                Categorical(['balanced'], name='model__class_weight'),
                Categorical(['binary'], name='model__objective')
            ]
        }
    }

    document = Document()
    document.add_heading('Report of training', 0)

    best_model = find_best_calibrated_binary_model(X=X, y=y, dict_model_params=dict_models, report_doc=document,
                                                   verbose=True, k_inner_fold=2, k_outer_fold=2,
                                                   n_initial_points=10, n_calls=10, metric='histogram_width',
                                                   peeking_metrics=['roc_auc', 'log_loss', 'pr_auc', 'brier_loss'])
    document.save('report.docx')

Best Answer

Related Solutions

Solved – Probability calibration metric for multiclass classifier

Solved – Proper way to incorporated CalibratedClassifierCV in cross-validation in Scikit

Related Question