LightGBM Model – Improvement Tips for LightGBM Model Focusing on Probability Prediction

aucboostingclassificationmachine learningpredictive-models

I am building a binary classifier using LightGBM. The goal is not to predict the outcome as such, but rather to predict the probability of the target even. To be more specific, it's more about ranking different objects based on the probability of the target event for them.

The dataset is imbalanced in that the distribution of classes is roughly 1 to 10. Not that the data is severely imbalanced, but this is definitely something that has its impact on the model's performance.

Given that probabilities are key for this task, I assumed that targeting the AUC score is more beneficial here especially given that it's somewhat immune to uneven class distributions.

I have a feeling that I didn't do a great job in feature engineering (I realize the importance of this part here), but let's assume for a moment that this is the dataset that I need to work with and all the feature engineering tricks have already been implemented.

Honestly speaking, I take it for granted that boosting-based models do not require much data wrangling. For instance, label encoding is enough and computationally expensive one-hot encoding can even be outperformed, etc.

With all that said, the results I get are far from perfect. Having an AUC score of 0.82 makes me think that in terms of probability prediction, the model is not awful, but the other metrics, as you can see, are satisfactory at best.

F1-score: 0.508
ROC AUC Score: 0.817
Cohen Kappa Score: 0.356

Analyzing the precision/recall curve and trying to find the threshold that sets their ratio to $\approx1$ yields a more balanced situation, but for this task, it's not yet clear which type of error should be minimized or whether, say, f1-score maximization is the target.

Anyway, all the conventional metrics are dependent upon the chosen threshold so it's not clear whether I can just save time on threshold tuning.

My questions:

  1. Would it be correct to state that having a reasonably high AUC for such tasks can be prioritized as opposed to just looking at precision, recall and other metrics that are functions of thresholds?

  2. I use a combination of Optuna and 5-fold cross-validation to select the best hyperparameters. The results, however, do not improve significantly. I cannot even get a very high AUC score on the train dataset regardless of the number of estimators used for LGBMClassifier.
    Does it mean that this is some kind of plateau for this task, dataset, and features?
    What are some common methods (in addition to better feature engineering and getting more data) to improve gradient boosting methods' results?

             precision    recall  f1-score   support

       False       0.92      0.76      0.83     10902
        True       0.40      0.70      0.51      2482

    accuracy                           0.75     13384
   macro avg       0.66      0.73      0.67     13384
weighted avg       0.82      0.75      0.77     13384
Results for threshold=0.66:
              precision    recall  f1-score   support

       False       0.89      0.89      0.89     10902
        True       0.52      0.51      0.51      2482

    accuracy                           0.82     13384
   macro avg       0.70      0.70      0.70     13384
weighted avg       0.82      0.82      0.82     13384

F1-score: 0.515
ROC AUC Score: 0.817
Cohen Kappa Score: 0.405
```

Best Answer

Using the binary log-loss classification as an objective is a good move in this situation (and in most situations). We might want to point Optuna (or our general hyper-parameter search framework) to minimise the Brier score of the predictions if we care about how much the probabilities might be off; the AUC-ROC is a ranking score, it is better than F1-score for this task but not our best bet necessarily.

Regarding the particular questions in the main post:

  1. Yes, but we can potentially do better (as discussed above). Using metrics based on discontinuous rules like Precision, Recall, F1, etc. can be misleading. This post on Is accuracy an improper scoring rule in a binary classification setting? focuses on Accuracy but the same applies for metrics like Precision, etc.
  2. Try different hyper-parameters as well as learners; LightGBM is awesome but not a panacea. Even simply trying XGBoost and Catboost might be enough to explore some obvious easy pickings.

Regarding the sub-question in the comments:

  1. Using isotonic regression can be beneficial but it has to be setup carefully (hold-out sets, etc.). I do it irrespective of "resampling" if I have time but usually it give me little gains in terms of ROC-/PR-AUC. It might worth considering other calibration options too like Platt scaling and beta calibration; I have not found one to dominate over the others in my work though.
  2. Please see my answer in the CV.SE thread: Biased prediction (overestimation) for xgboost I think it is pertinent to your question. As mentioned there, (early) gradient boosting implementation are (were?) not very well calibrated. With larger datasets and more well-designed loss-functions this might have been ameliorated nowadays to some extent but I have not seen any recent papers.
Related Question