I am training a lightgbm model on a binary problem (~20% of events) with below parameters:
clf = LGBMClassifier(
boosting_type= 'gbdt',
objective= 'binary',
metric= 'auc',
learning_rate= 0.03,
max_depth= 3,
num_leaves= 10,
feature_fraction= 0.7,
bagging_fraction= 1,
bagging_freq= 20,
n_jobs=5,
n_estimators=5000)
I am using early stopping criteria for 200 rounds. Here is how the evaluation output looks like:
Training until validation scores don't improve for 200 rounds.
[100] training's auc: 0.711959 valid_1's auc: 0.704253
[200] training's auc: 0.723237 valid_1's auc: 0.710111
[300] training's auc: 0.729497 valid_1's auc: 0.711863
[400] training's auc: 0.734452 valid_1's auc: 0.712717
[500] training's auc: 0.738592 valid_1's auc: 0.713287
[600] training's auc: 0.742438 valid_1's auc: 0.713539
[700] training's auc: 0.746025 valid_1's auc: 0.71387
[800] training's auc: 0.749565 valid_1's auc: 0.714064
[900] training's auc: 0.75316 valid_1's auc: 0.714063
[1000] training's auc: 0.756591 valid_1's auc: 0.713949
Early stopping, best iteration is:
[828] training's auc: 0.750595 valid_1's auc: 0.714133
As one can notice, there is huge gap in training and validation AUC. How can I minimise this gap?
I have tried changing – learning rate, depth, feature fraction, bagging freq,num of leaves, minimum sample in leaf, l1 & l2 regularization.
I am using 5-fold cross validation on a training sample of (100k records) and about 120 features.
What else can I try to minimise this gap? Any other factors I should be looking at? How to identify set of variables that might be causing overfitting
Best Answer
As you've pointed out in your title - you're overfitting on the training set and your model is suffering from high variance. In general, you can address variance by:
A clear starting point for variance reduction (the easiest thing to try), given the model's hyperparameters, is to constrain
n_estimators
. 5000 estimators is 500x the the default value of 100 (see 1). You could train models with values of e.g. 10, 20, 50, 100, 200, 500, as alluded to by "doing proper grid search" in the first comment on your post, observing how your metric changes in response to an increasing number of estimators.It's possible (and anecdotally, likely) that your model will perform as good as it is on your validation set with fewer estimators.