LightGBM Model – Improvement Tips for LightGBM Model Focusing on Probability Prediction

aucboostingclassificationmachine learningpredictive-models

I am building a binary classifier using LightGBM. The goal is not to predict the outcome as such, but rather to predict the probability of the target even. To be more specific, it's more about ranking different objects based on the probability of the target event for them.

The dataset is imbalanced in that the distribution of classes is roughly 1 to 10. Not that the data is severely imbalanced, but this is definitely something that has its impact on the model's performance.

Given that probabilities are key for this task, I assumed that targeting the AUC score is more beneficial here especially given that it's somewhat immune to uneven class distributions.

I have a feeling that I didn't do a great job in feature engineering (I realize the importance of this part here), but let's assume for a moment that this is the dataset that I need to work with and all the feature engineering tricks have already been implemented.

Honestly speaking, I take it for granted that boosting-based models do not require much data wrangling. For instance, label encoding is enough and computationally expensive one-hot encoding can even be outperformed, etc.

With all that said, the results I get are far from perfect. Having an AUC score of 0.82 makes me think that in terms of probability prediction, the model is not awful, but the other metrics, as you can see, are satisfactory at best.

F1-score: 0.508
ROC AUC Score: 0.817
Cohen Kappa Score: 0.356

Analyzing the precision/recall curve and trying to find the threshold that sets their ratio to $\approx1$ yields a more balanced situation, but for this task, it's not yet clear which type of error should be minimized or whether, say, f1-score maximization is the target.

Anyway, all the conventional metrics are dependent upon the chosen threshold so it's not clear whether I can just save time on threshold tuning.

My questions:

Would it be correct to state that having a reasonably high AUC for such tasks can be prioritized as opposed to just looking at precision, recall and other metrics that are functions of thresholds?
I use a combination of Optuna and 5-fold cross-validation to select the best hyperparameters. The results, however, do not improve significantly. I cannot even get a very high AUC score on the train dataset regardless of the number of estimators used for LGBMClassifier.
Does it mean that this is some kind of plateau for this task, dataset, and features?
What are some common methods (in addition to better feature engineering and getting more data) to improve gradient boosting methods' results?

             precision    recall  f1-score   support

       False       0.92      0.76      0.83     10902
        True       0.40      0.70      0.51      2482

    accuracy                           0.75     13384
   macro avg       0.66      0.73      0.67     13384
weighted avg       0.82      0.75      0.77     13384

Results for threshold=0.66:
              precision    recall  f1-score   support

       False       0.89      0.89      0.89     10902
        True       0.52      0.51      0.51      2482

    accuracy                           0.82     13384
   macro avg       0.70      0.70      0.70     13384
weighted avg       0.82      0.82      0.82     13384

F1-score: 0.515
ROC AUC Score: 0.817
Cohen Kappa Score: 0.405
```

Best Answer

Using the binary log-loss classification as an objective is a good move in this situation (and in most situations). We might want to point Optuna (or our general hyper-parameter search framework) to minimise the Brier score of the predictions if we care about how much the probabilities might be off; the AUC-ROC is a ranking score, it is better than F1-score for this task but not our best bet necessarily.

Regarding the particular questions in the main post:

Yes, but we can potentially do better (as discussed above). Using metrics based on discontinuous rules like Precision, Recall, F1, etc. can be misleading. This post on Is accuracy an improper scoring rule in a binary classification setting? focuses on Accuracy but the same applies for metrics like Precision, etc.
Try different hyper-parameters as well as learners; LightGBM is awesome but not a panacea. Even simply trying XGBoost and Catboost might be enough to explore some obvious easy pickings.

Regarding the sub-question in the comments:

Using isotonic regression can be beneficial but it has to be setup carefully (hold-out sets, etc.). I do it irrespective of "resampling" if I have time but usually it give me little gains in terms of ROC-/PR-AUC. It might worth considering other calibration options too like Platt scaling and beta calibration; I have not found one to dominate over the others in my work though.
Please see my answer in the CV.SE thread: Biased prediction (overestimation) for xgboost I think it is pertinent to your question. As mentioned there, (early) gradient boosting implementation are (were?) not very well calibrated. With larger datasets and more well-designed loss-functions this might have been ameliorated nowadays to some extent but I have not seen any recent papers.

Related Solutions

Machine Learning Feature Engineering – Does Feature Engineering Matter for Random Forest or Gradient Boosting?

It is reasonably widely recognised that feature engineering improves the outcome when using relatively advanced algorithms such as GBMs or Random Forests. There are a few reasons, relating both to overall accuracy and to useability. Firstly, if you actually want to use the model, features will require maintenance and implementation and will often require explanation to users. That is, each extra feature will create extra work. So for practical purposes, it's useful to eliminate features that don't contribute materially to improved accuracy.

With respect to overall accuracy, additional features and/or poorly engineered features increase the likelihood that you're training your model on noise rather than signal. Hence using domain knowledge or inspection of the data to suggest alternative ways to engineer features will usually improve results. The kaggle blog - blog.kaggle.com - includes 'how they did it' write-ups from podium finishers in each competition. These usually include descriptions of feature engineering - arguably more frequently than descriptions of model tuning, emphasising the importance of feature engineering - and some of them are very creative, including leveraging off domain knowledge provided by competition organisers or otherwise discovered during the competition.

This recent write-up is a good example of domain knowledge acquired during competition being used to select/ engineer features https://medium.com/kaggle-blog/2017-data-science-bowl-predicting-lung-cancer-2nd-place-solution-write-up-daniel-hammack-and-79dc345d4541 (the sections headed 'Pre-processing' and 'External Data' give good examples).

Classification – Which Performance Metrics Are Best for Highly Imbalanced Multiclass Datasets?

For unbalanced classes, I would suggest to go with Weighted F1-Score or Average AUC/Weighted AUC

Let's first see F1-Score for binary classification.

The F1-score gives a larger weight to lower numbers.

For example,

when Precision is 100% and Recall is 0%, the F1-score will be 0%, not 50%.
When let us say, we have Classifier A with precision=recall=80%, and Classifier B has precision=60%, recall=100%. Arithmetically, the mean of the precision and recall is the same for both models. But when we use F1’s harmonic mean formula, the score for Classifier A will be 80%, and for Classifier B it will be only 75%. Model B’s low precision score pulled down its F1-score.

Now, come to the Mutliclass Classification

Let us suppose we have the five classes, class_1, class_2, class_3, class_4, class_5

and the model is having the following results for each class.

forula for precision for each class = (True Positive for class)/(Count of predicted Positive for that class)

e.g. precision for class_1 = (True Positive for class_1)/(Count of Predicted of class_1)

forula for Recall for each class = (True Positive for class)/(Actual Positive for that class)

e.g. precision for class_1 = (True Positive for class_1)/(Total instances of class_1)

Formula for F1: F1 is the geometric mean of Precision and Recall i.e.

F1 = 2*(Precision*Recall)/(Precision+Recall)

Macro-F1 = Average(Class_1_F1 + Class_2_F1 + Class_3_F1 + Class_4_F1 + Class_5_F1)

Macro-Precision = Average(Class_1_Precision + Class_2_Precision + Class_3_Precision + Class_4_Precision + Class_5_Precision)

Macro-Recall = Average(Class_1_Recall + Class_2_Recall + Class_3_Recall + Class_4_Recall + Class_5_Recall)

Problem with Macro calculation: When averaging the macro-F1, we gave equal weights to each class.

Weighted F1 Score:

We don’t have to do that: in weighted-average F1-score, or weighted-F1, we weight the F1-score of each class by the number of samples from that class.

Weighted F1 Score = (N1*Class_1_F1 + N2*Class_2_F1 + N3*Class_3_F1 + N4*Class_4_F1 + N5*Class_5_F1)/(N1 + N2 + N3 + N4 + N5)

References: https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1

Best Answer

Related Solutions

Machine Learning Feature Engineering – Does Feature Engineering Matter for Random Forest or Gradient Boosting?

Classification – Which Performance Metrics Are Best for Highly Imbalanced Multiclass Datasets?

Related Question