Solved – Fitting sklearn GridSearchCV model

machine learningrandom forestregressionscikit learn

I am trying to solve a regression problem on Boston Dataset with help of random forest regressor.I was using GridSearchCV for selection of best hyperparameters.

Problem 1

Should I fit the GridSearchCV on some X_train, y_train and then get the best parameters.

OR

Should I fit it on X, y to get best parameters.(X, y = entire dataset)

Problem 2

Say If I fit it on X, y and get the best parameters and then build a new model on these best parameters.
Now how should I train this new model on ?

Should I train the new model on X_train, y_train or X, y.

Problem 3

If I train new model on X,y then how will I validate the results ?

My code so far

   #Dataframes
    feature_cols = ['CRIM','ZN','INDUS','NOX','RM','AGE','DIS','TAX','PTRATIO','B','LSTAT']

    X = boston_data[feature_cols]
    y = boston_data['PRICE']

Train Test Split of Data

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

Grid Search to get best hyperparameters

from sklearn.grid_search import GridSearchCV
param_grid = { 
    'n_estimators': [100, 500, 1000, 1500],
    'max_depth' : [4,5,6,7,8,9,10]
}

CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)

CV_rfc.best_params_ 
#{'max_depth': 10, 'n_estimators': 100}

Train a Model on the max_depth: 10, n_estimators: 100

RFReg = RandomForestRegressor(max_depth = 10, n_estimators = 100, random_state = 1)
RFReg.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)

RMSE: 2.8139766730629394

I just want some guidance with what the correct steps would be

Best Answer

This does depend a little on how what intent you have for X_test, y_test, but I'm going to assume that you set this data aside so you can get an accurate assessment of your final model's generalization ability (which is good practice).

In that case, you want to determine your hyperparameters using only the training data, so your parameter tuning cross validation should be run using only the training data as the base dataset. If instead you use the entire data set, then your test data provides some information towards your choice of hyperparameters, and your subsequent estimate of the test error will be overly optimistic.

Additionally, tuning n_estimators in a random forest is a widespread anti-pattern. There's no need to tune that parameter, larger always leads to a model with the same bias but with less variance, so larger is always no worse. You really only need to be tuning max_depth here. Here's a reference for that advice.

But my main concern is hyperparamters that I will get will be biased to the training dataset

Yup. That's always true and fundamentally unavoidable, you have to use some set of data to tune the values of those parameters, so in the end, they have to be biased towards performance on some sample. The best you can do is rigorously use cross validation and a test set to minimize that bias, and measure your results.

This is what I am not understanding. How will I implement Gridsearch and cross validation both in scikit learn.

The basic procedure is:

  • Split raw data into train and test.
  • Use cross validation on the split off training data to estimate the optimal values of hyperparameters (by minimizing the CV test error).
  • Fit a single model to the entire training data using the determined optimal hyperparameters.
  • Score that model on your original test data to estimate the performance of the final model.
Related Question