Solved – Proper way to incorporated CalibratedClassifierCV in cross-validation in Scikit

calibrationclassificationcross-validationtuning

I'm creating some classifiers for a binary classification problem. I want to find out three things:

  1. Which algorithm I should use.
  2. Which set of hyperparameters I
    should use.
  3. If I should calibrate the probability output of the
    classifier or not.

I was wondering how best to do this. Basically I'm doing nested cross-validation (outer loop for algorithm and inner for hyperparamters) and combining it with probability calibration (and I know I shouldn't use the same data to train the model and calibrate probabilities). Here's the code I've come up with (it uses a toy dataset):

# loading data
cancer = datasets.load_breast_cancer()

X = pd.DataFrame(cancer['data'], columns=cancer['feature_names'])
y = pd.DataFrame(cancer['target'], columns=['target'])
df = pd.DataFrame(pd.concat([X, y], axis=1))

# creating holdout data for final model evaluation
X, X_hold, y, y_hold = train_test_split(X,y,train_size=0.8, random_state=35)

# defining everything needed for cross-validation
kfold = KFold(3, random_state=1234, shuffle=True)
rf = RandomForestClassifier()
cart = DecisionTreeClassifier()

rf_parameters = {'n_estimators': [10, 40, 100], 'max_depth': [1, 5, 10]}
cart_parameters = { 'max_depth': [1, 5, 10]}

models = {cart:cart_parameters, rf:rf_parameters}

scoring = {'AUC': 'roc_auc', 'Brier_loss': 'neg_brier_score'}

brier_scores = []
resulting_models = []

for m,p in models.items():
    
    for train_index, test_index in kfold.split(X):
        
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        
        X_train_model, X_train_calibration, y_train_model, y_train_calibration \
                 = train_test_split(X_train, y_train, test_size=0.4, random_state=1234)
        
        
        clf = GridSearchCV(estimator=m, param_grid=p, scoring=scoring,
                            refit='Brier_loss', n_jobs=8, cv=3, verbose=1)
        
        clf.fit(X_train_model, y_train_model)
        
        best_model = clf.best_estimator_
        predictions = best_model.predict_proba(X_test)[:,1]
        
        calibrated = CalibratedClassifierCV(best_model, cv="prefit")
        calibrated.fit(X_train_calibration, y_train_calibration)
        
        predictions_calibrated = calibrated.predict_proba(X_test)[:,1]
        
        score = brier_score_loss(y_test, predictions)
        calibrated_score = brier_score_loss(y_test, predictions_calibrated)
                
        if score <= calibrated_score:
            resulting_models.append(best_model)
            brier_scores.append(score)
        else:
            resulting_models.append(calibrated)
            brier_scores.append(calibrated_score)

            
# printing results for decision
final_scores = list(zip(resulting_models, brier_scores))
final_scores_rf = [final_scores[i][1] for i in range(1,len(final_scores), 2)]
final_scores_cart = [final_scores[i][1] for i in range(0,len(final_scores), 2)]

print('CART:', reduce(lambda x,y: x+y, final_scores_cart)/3, 'RF:', reduce(lambda x,y: x+y, final_scores_rf)/3)

At the end of this bit of code, I will be able to decide which algorithm to use (CART vs RF). I will then remove the first loop so I can decide which set of hyperparameters to use. After that, I will remove the GridSearch part to decide only whether I should calibrate my probabilities or not.

After all this, I can evaluate the "true" error of my model using X_hold and y_hold. Then I'll retrain the model and calibration (if necessary) using the full dataset.

Does this make sense to you? Any suggestions on how to do this properly? I feel like I'm missing something.

Best Answer

I have this exact problem. Well, in fact I had another layer of complexity because I wanted to select also the best post-processing (for example, scaling, PCA, selecting K Best...) and I wondered if I needed another step of CV or not. I have searched a lot for answers on the internet and have not found any, so I'll try to explain what I have done. First of all, my thoughts:

  • It makes sense to have a nested CV scheme in order to select model and hyperparameters on the inner loop and then assess the quality of the model (in fact, of the selection procedure for the model + the model itself) on the outer loop.
  • It makes sense to repeat the procedure on the full dataset after you have assessed the error that you make.
  • The problem that calibration presents is that for calibration you need new data, as you have pointed out, so that is another issue to solve.

So I have thougth how to make sense of all of it, that is, how to select best calibrated model + best parameters (including best post-process parameters or procedure), and how to assess how good is the model in a completely unseen data. And I wanted to do it with double nested cross validation, because three layers is too much complexity.

In my view, the inner cross validation should train the model as if it were the definite one, so the calibration should enter in the inner layer. The difficult that then arises is that you don't have, in principle, new data to see how good is the model in order to select best parameters (by the way, brier loss is, as far as I know, a not very good metric). To deal with it, I have tried two things:

  • Define a new metric for optimization, that I have called histogram-width, and that measeures how wide the histogram of predicted probabilities is. It makes sense to me because if your model is calibrated, the more confident the predictions are, the wider the histogram will be (most of the predictions will be near 0 or 1 if the model has good predictive power). The advantage of this metric is that it does not need new data. By experiment, I have found that it's really close to ROC AUC in a hold out set.
  • Try common metrics, but care must be taken in order to evaluate them. As CalibratedClassifierCV is an ensemble of base estimators and calibrators, you can use the inner base estimators to check the metric in the holdout set of the same KFold. I had to do some assumptions and I think that my code is correct, but it's kind of weird.

As code is more explanatory than words, I'll post here my own code of this function that tries to do what I explained before:

  1. Select best model + hyperparameters + parameters of a post-process pipeline with an inner Cross Validation.
  2. Validate this procedure of model selection with an outer cross validation and generate a report in a Document of python-docx.
  3. Repeate the procedure with the full dataset in order to build the model.

I always try to write nice code and end up with a something very messy... I'm sorry. You can ask for anything that is not clear.

import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.decomposition import PCA
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.model_selection import StratifiedKFold
from skopt.space import Real, Integer, Categorical
from skopt.utils import use_named_args
from skopt import gp_minimize
from xgboost import XGBClassifier
from imblearn.pipeline import Pipeline
from skopt.space.transformers import Identity
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_recall_curve, roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import brier_score_loss, average_precision_score, log_loss
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
import matplotlib.pyplot as plt
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler, NearMiss, TomekLinks, NeighbourhoodCleaningRule
from imblearn.combine import SMOTEENN, SMOTETomek
import functools
import lightgbm as lgb
from docx import Document
from docx.shared import Inches
from io import BytesIO
from sklearn.metrics import confusion_matrix
import seaborn as sns
import numpy as np


def plot_calibration_curve(y, y_proba):
    fig, ax = plt.subplots()
    prob_true_1, prob_pred_1 = calibration_curve(y, y_proba, n_bins=12)
    ax.plot([0, 1], [0, 1], linestyle='--', label='Perfect calibration')
    ax.plot(prob_true_1, prob_pred_1, marker='.')
    ax.set(xlabel='Average predicted probability in each bin', ylabel='Ratio of positives')
    memfile = BytesIO()
    plt.savefig(memfile)
    return memfile

def plot_precision_recall_curve(y, y_proba):
    fig, ax = plt.subplots()
    precision, recall, _ = precision_recall_curve(y, y_proba)
    average_precision = average_precision_score(y, y_proba)
    ax.plot(precision, recall, label=f'AP = {average_precision:0.2f}')
    ax.set(xlabel='Precision', ylabel='Recall')
    ax.legend(loc="lower left")
    memfile = BytesIO()
    plt.savefig(memfile)
    return memfile

def plot_roc_curve(y, y_proba):
    fig, ax = plt.subplots()
    fpr, tpr, _ = roc_curve(y, y_proba)
    roc_auc = roc_auc_score(y, y_proba)
    ax.plot(fpr, tpr, label=f'ROC = {roc_auc:0.2f}')
    ax.set(xlabel='False Positive Rate', ylabel='True Positive Rate')
    ax.legend(loc="lower right")
    memfile = BytesIO()
    plt.savefig(memfile)
    return memfile

def plot_confusion_matrix(y, y_proba):
    y_pred = y_proba > 0.5
    fig, ax = plt.subplots()
    confm = confusion_matrix(y, y_pred)
    # Normalize
    confm = confm.astype('float') / confm.sum(axis=1)[:, np.newaxis]
    ax = sns.heatmap(confm, cmap='Oranges', annot=True)
    ax.set(xlabel='Predicted label', ylabel='True label')
    memfile = BytesIO()
    plt.savefig(memfile)
    return memfile

def neg_score(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        score = func(*args, **kwargs)
        return 1.0 - score
    return wrapper

class MyIdentity(Identity):
    def fit(self, X, y):
        return self

# It does not work if any of the pipelines in the
# dict_pipelines has a resampler transformer.
# In this case, this has to go in a different step.
class OptionedPostProcessTransformer(TransformerMixin):

    def __init__(self, dict_pipelines):
        self.dict_pipelines = dict_pipelines
        self.option = list(dict_pipelines.keys())[0]
        super().__init__()

    def fit(self, X, y=None):
        self.dict_pipelines[self.option].fit(X, y)
        return self

    def set_params(self, **params):
        self.option = params['option']
        return self

    def transform(self, X):
        return self.dict_pipelines[self.option].transform(X)

    def fit_transform(self, X, y=None):
        return self.dict_pipelines[self.option].fit_transform(X, y)

def histogram_width(y_true, y_proba):
    return 4 * (np.sum((y_proba - 0.5) ** 2) / len(y_proba))

dict_resamplings = {
    'random_over': RandomOverSampler(),
    'smote': SMOTE(),
    'adasyn': ADASYN(),
    'random_under': RandomUnderSampler(),
    'nearmiss': NearMiss(version=3, n_neighbors_ver3=3),
    'tomeklinks': TomekLinks(),
    'ncr': NeighbourhoodCleaningRule(),
    'smotetomek': SMOTETomek(),
    'smoteenn': SMOTEENN(sampling_strategy='minority')
}

dict_metrics_loss = {
    'roc_auc': neg_score(roc_auc_score),
    'log_loss': log_loss,
    'pr_auc': neg_score(average_precision_score),
    'brier_loss': brier_score_loss,
    'histogram_width': neg_score(histogram_width)
}

dict_pipelines_post_process = {
        'option_1': Pipeline([
            ('scale', StandardScaler()),
            ('reduce_dims', PCA(n_components=50))
        ]),
        'option_2': Pipeline([
            ('scale', StandardScaler()),
            ('reduce_dims', SelectKBest(mutual_info_classif, k=100))
        ]),
        'option_3': Pipeline([
            ('identity', MyIdentity())
            ])
}

dict_models_example = {
    'gradient_boosting': {
        'model': GradientBoostingClassifier(),
        'pipeline_post_process': Pipeline([("identity", MyIdentity())]),
        'search_space': [
            Integer(4, 12, name='model__max_depth'),
            Integer(10, 500, name='model__n_estimators'),
            Real(0.001, 0.15, prior='log-uniform', name='model__learning_rate'),
            Real(0.005, 0.10, prior='log-uniform', name='model__min_samples_split'),
            Real(0.005, 0.10, prior='log-uniform', name='model__min_samples_leaf'),
            Real(0.8, 1, prior='log-uniform', name='model__subsample')
        ]
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'pipeline_post_process': Pipeline([
            ('scale', StandardScaler()),
            ('reduce_dims', PCA(n_components=50))
        ]),
        'search_space': [
            Integer(30, 100, name='reduce_dims__n_components'),
            Integer(0, 1, name='model__bootstrap'),
            Integer(10, 1000, name='model__n_estimators'),
            Integer(5, 15, name='model__max_depth'),
            Integer(5, 50, name='model__min_samples_split'),
            Integer(1,4, name='model__min_samples_leaf'),
            Categorical(['auto', 'sqrt'], name='model__max_features'),
            Categorical(['balanced', 'balanced_subsample'], name='model__class_weight')
        ]
    },
    'xgboost': {
        'model': XGBClassifier(),
        'pipeline_post_process': Pipeline([
            ('post_process', OptionedPostProcessTransformer(dict_pipelines_post_process))
        ]),
        'search_space': [
            Categorical(['option_1', 'option_2', 'option_3'], name='post_process__option'),
            Integer(5, 15, name='model__max_depth'),
            Real(0.05, 0.31, prior='log-uniform', name='model__learning_rate'),
            Integer(1, 10, name='model__min_child_weight'),
            Real(0.8, 1, prior='log-uniform', name='model__subsample'),
            Real(0.13, 0.8, prior='log-uniform', name='model__colsample_bytree'),
            Real(0.1, 10,prior='log-uniform', name='model__scale_pos_weight'),
            Categorical(['binary:logistic'], name='model__objective')
        ]
    },
    'lightgbm': {
        'model': lgb.LGBMClassifier(),
        'pipeline_post_process': Pipeline([("identity", MyIdentity())]),
        'search_space': [
            Real(0.01, 0.5, prior='log-uniform', name='model__learning_rate'),
            Integer(1, 30, name='model__max_depth'),
            Integer(10, 400, name='model__num_leaves'),
            Real(0.1, 1.0, prior='uniform', name='model__feature_fraction'),
            Real(0.1, 1.0, prior='uniform', name='model__subsample'),
            Categorical(['balanced'], name='model__class_weight'),
            Categorical(['binary'], name='model__objective')
        ]
    }
}

def find_best_calibrated_binary_model(
        X,
        y,
        k_outer_fold=5,
        k_inner_fold=2,
        report_doc=None,
        n_initial_points=5,
        n_calls=10,
        dict_model_params=dict_models_example,
        metric='histogram_width',
        peeking_metrics=None,
        verbose=False,
        skopt_func=gp_minimize
    ):
    """Finds best binary calibrated classification model and optionally
    generate a report doing a nested cross validation. In the inner
    cross validation, doing a Bayesian Search, the best parameters are found.
    In the outer cross validation, the model is validated.
    Finally, the whole procedure is used for the full dataset to return
    the best possible model.


    Parameters
    ----------
    X : np.array
        Feature set.

    y : np.array
        Classification target to predict.

    k_outer_fold : int, default=4
        Number of folds for the outer cross-validation.

    k_inner_fold : int, default=4
        Number of folds for the inner cross-validation.

    report_doc : Document or None
        Document used to write report of training.

    n_initial_points : int, default=5
        Number of initial points to use in Bayesian Optimization.

    n_calls : int, default=5
        Number of additional calls to use in Bayesian Optimization.

    dict_models_params : Dict[str : List[List[skopt.Space]]
        Dict of models to try inside of the inner loops. For each model, there is
        the corresponding list of space objects to delimit where the parameters live,
        including the pipeline postprocess to make.

    metric : str, default='auc'
        Metric to use in order to find best parameters in Bayesian Search. Options:
        - auc
        - pr_auc
        - brier_loss
        - wide_histogram

    peeking_metrics : List[str], default=None
        If not None, in the report there will be a comparison between the metric of
        evaluation on the inner fold and the list of metrics in peeking_metrics on the
        outer fold. This can be used to assess the quality of the metric used, but could
        lead to underestimate the error if not taken proper care.

    verbose : bool, default=False
        If True, you can trace the progress in the terminal.

    skopt_func : callable, default=gp_minimize
        Minimization function of the skopt library to be used.

    Returns
    -------
    model : Model trained with the full dataset using the same procedure
    as in the inner cross validation.
    document : If report_func is not none, a document with the full report.
    """
    outer_cv = StratifiedKFold(n_splits=k_outer_fold)
    k=0
    for train_index, test_index in outer_cv.split(X, y):
        report_doc.add_heading(f'Report of training in fold {k} of outer Cross Validation', level=1)
        inner_model = train_inner_calibrated_binary_model(
            X=X[train_index], y=y[train_index], k_inner_fold=k_inner_fold,
            X_hold_out=X[test_index], y_hold_out=y[test_index],
            report_doc=report_doc, n_initial_points=n_initial_points,
            n_calls=n_calls, dict_model_params=dict_model_params,
            metric=metric, peeking_metrics=peeking_metrics, verbose=verbose,
            skopt_func=skopt_func)
        report_doc.add_heading(f'Report of validation in fold {k} of outer Cross Validation', level=1)
        evaluate_model(
            model=inner_model, X=X[test_index], y=y[test_index],
            report_doc=report_doc, peeking_metrics=peeking_metrics
        )
        k += 1
    # After assessing the procedure, we repeat it on the full dataset:
    return train_inner_calibrated_binary_model(
            X=X, y=y, k_inner_fold=k_inner_fold,
            report_doc=None, n_initial_points=n_initial_points,
            n_calls=n_calls, dict_model_params=dict_model_params,
            metric=metric, verbose=verbose, skopt_func=skopt_func)

def evaluate_model(model, X, y, peeking_metrics=None, report_doc=None):
    y_proba = model.predict_proba(X)[:, 1]
    if peeking_metrics:
        report_doc.add_heading(f'Main metrics', level=2)
        for metric in peeking_metrics:
            report_doc.add_paragraph(f"Metric {metric} is {dict_metrics_loss[metric](y, y_proba)}\n")
    report_doc.add_heading(f'Main plots', level=2)

    # Plot calibration curve
    report_doc.add_paragraph('Calibration plot')
    memfile = plot_calibration_curve(y, y_proba)
    document.add_picture(memfile, width=Inches(5))
    memfile.close()

    # Plot precision recall curve
    report_doc.add_paragraph('Precision-recall curve plot')
    memfile = plot_precision_recall_curve(y, y_proba)
    document.add_picture(memfile, width=Inches(5))
    memfile.close()

    # Plot roc curve
    memfile = plot_roc_curve(y, y_proba)
    report_doc.add_paragraph('ROC curve plot')
    document.add_picture(memfile, width=Inches(5))
    memfile.close()

    # Plot confussion matrix
    memfile = plot_confusion_matrix(y, y_proba)
    report_doc.add_paragraph('Confusion matrix')
    document.add_picture(memfile, width=Inches(5))
    memfile.close()

    return


def evaluate_metric_cv(score_func, calibrated_model, X, y, k_inner_fold, greater_is_better=False):
    scores = []
    inner_cv = StratifiedKFold(n_splits=k_inner_fold)
    for _, test_index in inner_cv.split(X, y):
        candidate_scores = []
        for classifier in calibrated_model.calibrated_classifiers_:
            X_hold_out = X[test_index]
            y_hold_out = y[test_index]
            y_proba = classifier.base_estimator.predict_proba(X_hold_out)[:, 1]
            candidate_scores.append(score_func(y_hold_out, y_proba))
        # As the order is of the trained base classifiers is not guaranteed,
        # we assume that the worst score is of the classifier that was not trained with this subset
        if greater_is_better:
            scores.append(min(candidate_scores))
        else:
            scores.append(max(candidate_scores))
    return sum(scores) / len(scores)

def train_inner_calibrated_binary_model(X, y, X_hold_out=None, y_hold_out=None,
                                        k_inner_fold=2, report_doc=None,
                                        n_initial_points=5, n_calls=10,
                                        dict_model_params=dict_models_example, metric='histogram_width',
                                        peeking_metrics=None, verbose=False, skopt_func=gp_minimize):
    list_models = []
    list_metrics = []
    list_comparisons = []
    score_loss_func = dict_metrics_loss[metric]

    for key in dict_model_params.keys():

        pipeline_post_process = dict_model_params[key]['pipeline_post_process']
        model = dict_model_params[key]['model']
        search_space = dict_model_params[key]['search_space']

        complete_steps = pipeline_post_process.steps + [('model', model)]
        complete_pipeline = Pipeline(complete_steps)

        @use_named_args(search_space)
        def func_to_minimize(**params):
            complete_pipeline.set_params(**params)
            if verbose:
                print(f"Optimizing model {key}\n")
                print(f"With parameters {params}\n")

            # calculate k-fold cross validation with Calibrated Classifier
            # If metric different than histogram_width, this is not valid

            calibrated_model = CalibratedClassifierCV(complete_pipeline, method='isotonic', cv=k_inner_fold)
            calibrated_model.fit(X, y)
            y_proba = calibrated_model.predict_proba(X)[:, 1]
            if metric != 'histogram_width':
                loss_score = evaluate_metric_cv(score_loss_func, calibrated_model, X, y, k_inner_fold)
            else:
                loss_score = score_loss_func(y, y_proba)

            list_models.append(calibrated_model)

            list_metrics.append(loss_score)
            if verbose:
                print(f"Metric is {loss_score}\n")

            if report_doc:
                dict_comparison = {}
                dict_comparison['model'] = key
                dict_comparison['params'] = params
                for peeking_metric in [metric] + peeking_metrics:
                    if peeking_metric == 'histogram_width':
                        inner_metric = dict_metrics_loss[peeking_metric](y, y_proba)
                    else:
                        inner_metric = evaluate_metric_cv(dict_metrics_loss[peeking_metric], calibrated_model, X, y, k_inner_fold)
                    y_hold_out_proba = calibrated_model.predict_proba(X_hold_out)[:, 1]
                    outer_metric = dict_metrics_loss[peeking_metric](y_hold_out, y_hold_out_proba)
                    dict_comparison['inner_' + peeking_metric] = inner_metric
                    dict_comparison['outer_' + peeking_metric] = outer_metric
                list_comparisons.append(dict_comparison)
            return loss_score

        # perform optimization
        skopt_func(func_to_minimize, search_space, n_initial_points=n_initial_points, n_calls=n_calls)
    index_best_model = list_metrics.index(min(list_metrics))
    best_model = list_models[index_best_model]
    if verbose:
        print("Best model found")
    if report_doc:
        comparisons_df = pd.DataFrame(list_comparisons)
        report_doc.add_heading(f'Comparison of best model with different metrics', level=2)
        report_doc.add_paragraph(f'Best model with respect to selected metric {metric} is {comparisons_df.loc[comparisons_df["inner_" + metric].idxmin()]}\n')
        for peeking_metric in peeking_metrics:
            report_doc.add_paragraph(f'Best model with respect to {peeking_metric} is {comparisons_df.loc[comparisons_df["inner_" + peeking_metric].idxmin()]}\n')
    return best_model

if __name__ == "__main__":
    dataset = pd.read_csv('D:/Python/nudge/data/04_feature/dataset.csv', **{'sep': ";", 'index_col': False, 'decimal': ','})
    index_cols = ['NIF Identificado', 'Ejercicio']
    target_col = ['target']
    dataset = dataset.sample(n=1000)
    y = dataset['target'].to_numpy()
    X = dataset[[c for c in dataset.columns if c not in index_cols + target_col]].to_numpy()
    dict_models = {
        'xgboost': {
            'model': XGBClassifier(),
            'pipeline_post_process': Pipeline([
                ('post_process', OptionedPostProcessTransformer(dict_pipelines_post_process)),
                ('resample', dict_resamplings['smote'])
            ]),
            'search_space': [
                Real(0.1, 1, name='resample__sampling_strategy'),
                Categorical(['option_1', 'option_2', 'option_3'], name='post_process__option'),
                Integer(5, 15, name='model__max_depth'),
                Real(0.05, 0.31, prior='log-uniform', name='model__learning_rate'),
                Integer(1, 10, name='model__min_child_weight'),
                Real(0.8, 1, prior='log-uniform', name='model__subsample'),
                Real(0.13, 0.8, prior='log-uniform', name='model__colsample_bytree'),
                Real(0.1, 10, prior='log-uniform', name='model__scale_pos_weight'),
                Categorical(['binary:logistic'], name='model__objective')
            ]
        },
        'lightgbm': {
            'model': lgb.LGBMClassifier(),
            'pipeline_post_process': Pipeline([("identity", MyIdentity())]),
            'search_space': [
                Real(0.01, 0.5, prior='log-uniform', name='model__learning_rate'),
                Integer(1, 30, name='model__max_depth'),
                Integer(10, 400, name='model__num_leaves'),
                Real(0.1, 1.0, prior='uniform', name='model__feature_fraction'),
                Real(0.1, 1.0, prior='uniform', name='model__subsample'),
                Categorical(['balanced'], name='model__class_weight'),
                Categorical(['binary'], name='model__objective')
            ]
        }
    }

    document = Document()
    document.add_heading('Report of training', 0)

    best_model = find_best_calibrated_binary_model(X=X, y=y, dict_model_params=dict_models, report_doc=document,
                                                   verbose=True, k_inner_fold=2, k_outer_fold=2,
                                                   n_initial_points=10, n_calls=10, metric='histogram_width',
                                                   peeking_metrics=['roc_auc', 'log_loss', 'pr_auc', 'brier_loss'])
    document.save('report.docx')
Related Question