Solved – Small dataset and optimal parameters for XGboost

I am in the process of tuning the features for my xgboost such as ordinal (label) encoding and one-hot encoding. For example, run the model with column A one-hot encoded, then run it label encoded and check the RMSE. For one of my iterations, the model improved by 5% (3,800 to 3,607 RMSE). This was strange to me because most of the iterations I tested would improve/detract the model by 1%. As in 5% was an outlier. I thought it might be due to the small sample size I had and thus there may be some overfitting.

I have a sample size of 648 observations. I split those into an 80/20 train and test set. Then I perform 5-fold cross validation on a grid-search of parameters. For those that do not want to do the math, that means 518/130, so 518 used to train and 130 observations test. Then 518 split again 80/20 is 414/104. I have also included my grid-search of parameters below and best parameters found.

What should I do to make sure that there is the least amount of overfitting? Should I increase my k for cross validation? Try to adjust gamma or max_delta_step (which I do not fully understand)? I know that is open ended, but what are the optimal parameters to tune to adjust for overfitting for small datasets?

model = XGBRegressor(booster ='gbtree', random_state = 13)
mymodel = Pipeline(steps = [('preprocessor', preprocessor),
                            ('model', model)
                            ])

# Set up Parameters for CV and Grid-Search                
param_grid = dict(
                  model__gamma = [0],
                  model__n_estimators = [100,500,1000],
                  model__max_delta_step =[0],
                  model__max_depth = [5,6,7],
                  model__learning_rate= [0.1,0.3,0.5,1],
                  model__min_child_weight= [1,3,5],
                  model__colsample_bytree=[0.8,1],
                  model__early_stopping_rounds = [42],
                  model__num_parallel_tree = [1,3]
                  )
# scoring methods
sm = ['neg_mean_squared_error']
gs = GridSearchCV(mymodel
                  ,param_grid = param_grid
                  ,scoring = sm[0]
                  ,n_jobs = -1
                  ,cv = 5
                  ,refit = sm[0]
                  )
# best parameters
{'model__colsample_bytree': 0.8,
 'model__early_stopping_rounds': 42,
 'model__gamma': 0,
 'model__learning_rate': 0.1,
 'model__max_delta_step': 0,
 'model__max_depth': 5,
 'model__min_child_weight': 5,
 'model__n_estimators': 100,
 'model__num_parallel_tree': 1}

Best Answer

It is difficult to give a smooth answer without having the data and all required subject knowledge at hand. Still I can throw in some comments.

Your validation strategy is fine as long as you decide everything by cross validation and not by the test data (but see 5.).
If your best solution picks value at the border, then this is usually not a good grid. It happens for several of your parameters.
For such small data, tree depth 5+ seems too much.
XGBoost has important additional regularization parameters like l1 and l2 penalties. Usually these need to be tuned as well.
Are the rows really independent or are there clusters of rows that invalidate your validation strategy?

Best Answer

Related Solutions

Small Sample Size Estimation – Understanding Parameters’ Uncertainty for Small Sample Size

Related Question