Solved – Controlling over-fitting in local cross-validation LightGBM

boostingcross-validationoverfitting

I am training a lightgbm model on a binary problem (~20% of events) with below parameters:

clf = LGBMClassifier(
    boosting_type= 'gbdt',
    objective= 'binary',
    metric= 'auc',
    learning_rate= 0.03,
    max_depth= 3,
    num_leaves= 10,
    feature_fraction= 0.7,
    bagging_fraction= 1,
    bagging_freq= 20,
    n_jobs=5,
    n_estimators=5000)

I am using early stopping criteria for 200 rounds. Here is how the evaluation output looks like:

Training until validation scores don't improve for 200 rounds.
[100]   training's auc: 0.711959    valid_1's auc: 0.704253
[200]   training's auc: 0.723237    valid_1's auc: 0.710111
[300]   training's auc: 0.729497    valid_1's auc: 0.711863
[400]   training's auc: 0.734452    valid_1's auc: 0.712717
[500]   training's auc: 0.738592    valid_1's auc: 0.713287
[600]   training's auc: 0.742438    valid_1's auc: 0.713539
[700]   training's auc: 0.746025    valid_1's auc: 0.71387
[800]   training's auc: 0.749565    valid_1's auc: 0.714064
[900]   training's auc: 0.75316 valid_1's auc: 0.714063
[1000]  training's auc: 0.756591    valid_1's auc: 0.713949
Early stopping, best iteration is:
[828]   training's auc: 0.750595    valid_1's auc: 0.714133

As one can notice, there is huge gap in training and validation AUC. How can I minimise this gap?

I have tried changing – learning rate, depth, feature fraction, bagging freq,num of leaves, minimum sample in leaf, l1 & l2 regularization.

I am using 5-fold cross validation on a training sample of (100k records) and about 120 features.

What else can I try to minimise this gap? Any other factors I should be looking at? How to identify set of variables that might be causing overfitting

Best Answer

As you've pointed out in your title - you're overfitting on the training set and your model is suffering from high variance. In general, you can address variance by:

  1. Constraining the model
  2. Using more training data
  3. Resolving data quality issues and removing outliers

A clear starting point for variance reduction (the easiest thing to try), given the model's hyperparameters, is to constrain n_estimators. 5000 estimators is 500x the the default value of 100 (see 1). You could train models with values of e.g. 10, 20, 50, 100, 200, 500, as alluded to by "doing proper grid search" in the first comment on your post, observing how your metric changes in response to an increasing number of estimators.

It's possible (and anecdotally, likely) that your model will perform as good as it is on your validation set with fewer estimators.

Related Question