I am trying to solve a regression problem on Boston Dataset with help of random forest regressor.I was using GridSearchCV for selection of best hyperparameters.
Problem 1
Should I fit the GridSearchCV
on some X_train, y_train
and then get the best parameters.
OR
Should I fit it on X, y
to get best parameters.(X, y = entire dataset)
Problem 2
Say If I fit it on X, y
and get the best parameters and then build a new model on these best parameters.
Now how should I train this new model on ?
Should I train the new model on X_train, y_train
or X, y.
Problem 3
If I train new model on X,y
then how will I validate the results ?
My code so far
#Dataframes
feature_cols = ['CRIM','ZN','INDUS','NOX','RM','AGE','DIS','TAX','PTRATIO','B','LSTAT']
X = boston_data[feature_cols]
y = boston_data['PRICE']
Train Test Split of Data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
Grid Search to get best hyperparameters
from sklearn.grid_search import GridSearchCV
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
CV_rfc.best_params_
#{'max_depth': 10, 'n_estimators': 100}
Train a Model on the max_depth: 10, n_estimators: 100
RFReg = RandomForestRegressor(max_depth = 10, n_estimators = 100, random_state = 1)
RFReg.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)
RMSE: 2.8139766730629394
I just want some guidance with what the correct steps would be
Best Answer
This does depend a little on how what intent you have for
X_test
,y_test
, but I'm going to assume that you set this data aside so you can get an accurate assessment of your final model's generalization ability (which is good practice).In that case, you want to determine your hyperparameters using only the training data, so your parameter tuning cross validation should be run using only the training data as the base dataset. If instead you use the entire data set, then your test data provides some information towards your choice of hyperparameters, and your subsequent estimate of the test error will be overly optimistic.
Additionally, tuning
n_estimators
in a random forest is a widespread anti-pattern. There's no need to tune that parameter, larger always leads to a model with the same bias but with less variance, so larger is always no worse. You really only need to be tuningmax_depth
here. Here's a reference for that advice.Yup. That's always true and fundamentally unavoidable, you have to use some set of data to tune the values of those parameters, so in the end, they have to be biased towards performance on some sample. The best you can do is rigorously use cross validation and a test set to minimize that bias, and measure your results.
The basic procedure is: