Machine Learning – How to Calibrate with Multiclass Classification Problem

calibrationclassificationmachine learningmathematical-statisticsmulti-class

I am training a model to predict the label (target) based on loan status e.g. 0,1,2,3. So i have 4 classes. I have so far trained a model as follows:

 from HyperclassifierSearch import HyperclassifierSearch

X = data.iloc[:, :-1]
y = data.label

    
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, 
random_state=42)
# Create a hold out dataset to train the calibrated model to prevent overfitting
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, 
stratify=y_train, test_size=0.2, random_state=42)
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
              
numeric_transformer = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan, fill_value=0) ),('scaler', StandardScaler())])
   
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols),
                                    ('cat', categorical_transformer, cat_cols)])

#then i use hyperclassifer library

models = {  'xgb': Pipeline(steps=[('preprocessor', preprocessor),('clf', XGBClassifier(objective='multi:softprob'))]),
                       'rf': Pipeline(steps=[('preprocessor', preprocessor),('clf', RandomForestClassifier(criterion = 'entropy', random_state = 42))]) }


search = HyperclassifierSearch(models, params)
best_grid = search.train_model(X_train, y_train, cv=3, n_jobs=-1, scoring='accuracy')
results = search.evaluate_model()
fitted_model = best_grid.best_estimator_

pred = fitted_model.predict_proba(X_test)
labels = fitted_model.predict(X_test)

**note i have omitted alot of the imported libs and params dict since large and only included hyperclassifier since it is large **

my pred is a matrix containing 4 columns where each is related to the class of the loan. Generally i know it is good pracitce to calibrate the probabilities and particularly from tree based algorithms the output is a score not really a probability. I am however confused as to how to calibrate these probabilities.

Usually i would calibrate using the holdout validation set but am unsure how to do it with multiclass

Update
Should i ammend the above xgbclassifier by doing the following:

OneVsRestClassifier(CalibratedClassifierCV(XGBClassifier(objective='multi:softprob'), cv=10))

https://stackoverflow.com/questions/31617530/multiclass-linear-svm-in-python-that-return-probability

My question is what is the correct way to calibrate probabilities from multiclass model?

Best Answer

I studied briefly this problem a few years ago, and my conclusion at the time was that there is no gold standard for calibrating multi-category forecasts yet.

There are some methods to calibrate the probabilities directly in the simplex space that involves using a Dirichlet distribution, but to the best of my knowledge, most approaches consist of a reduction to binary calibration tasks followed by the coupling of the adjusted probabilities.

From multiple categories to multiple binary tasks

First, there are many ways of decomposing a multi-category problem to binary tasks:

  • ”all pairs”: e.g. 0 vs 1, 0 vs 2, … 1 vs 2, 1 vs 3, …, 2 vs 3 … 18 of 55
  • ”one-against-all”: e.g. 0 vs all, 1 vs all, …
  • or something else. For instance, for an ordinal score, it would be interesting to do $y \leq 0$ vs $y > 0$, $y \leq 1$ vs $y > 1$, etc. (i.e. using the cumulative distribution function). In general, any decomposition can be represented by an Error-Correction Output Coding (ECOC) matrix.

Calibrating the binary problems

Then you can calibrate these binary tasks using your prefered method: Platt scaling, isotonic regression, beta calibration, etc.

Coupling the probabilities

And finally, the calibrated forecasts for each binary task needs to be coupled together. This is not trivial because you want to make sure the probabilities are a simplex (all positive and sum to one), without adding additional bias to your calibrated binary probabilities.

In a "one-against-all", if $p_i$ is the calibrated probability for the binary problem "$i$ vs all", then you can derive a multi-category calibrated forecast $\hat{p_i} = \frac{p_i}{\sum_j p_j}$. Note this is probably what OneVsRestClassifier is doing, but do check the documentation first.

If your data is ordinal and you choose the binary classification task corresponds to predicting the cumulative distribution function, you can use Frank et Hall's method [1] to derive the multi-category calibrated forecasts. Essentially, this consist in noticing that for $y \in \{1...K\}$, $P(y = 1) = P(y \leq 1)$, $P(y = K) = 1 - P(y <= K- 1)$ and for $1<k<K$, $P(y = k) = P(y \leq k) - P(y \leq k -1)$. The approach may lead to negative probabilities because the calibration of the different binary tasks are independent, but it's possible to slightly modify the equations using conditional probabilities to address this problem (see section 2.4 in [2] for example).

Other approaches

In the problem I encountered a few years ago, I ended up realising that I could obtain calibrated probabilities by using a more appropriate model. Usually, I find calibrating probabilities is the answer to the wrong problem. If you have predictions that are not calibrated, more often than not there is a problem with your model, so if you really care about calibration, you may want to consider using a model that is likely to yield calibrated probabilities in the first place.

For instance, you could consider using multinomial logistic regression or ordinal logistic regression (with or without regularisation).

References

[1] E. Frank and M. Hall, “A simple approach to ordinal classification,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2001, vol. 2167, pp. 145–156.

[2] J. S. Cardoso and J. F. Pinto Da Costa, “Learning to classify ordinal data: The data replication method,” J. Mach. Learn. Res., vol. 8, pp. 1393–1429, 2007.

Related Question