Python – How to Fix Getting Nan Scores from RandomizedSearchCV with Random Forest Classifier

hyperparameterpythonrandom forestscikit learn

I am trying to tune hyperparameters for a random forest classifier using sklearn's RandomizedSearchCV with 3-fold cross-validation. In the end, 253/1000 of the mean test scores are nan (as found via rd_rnd.cv_results_['mean_test_score']). Any thoughts on what could be causing these failed fits? Thanks.

The answer to a similar question with xgboost indicated this can occur due to an invalid set of hyperparameters, are there rules for the random forest parameters that I'm not seeing?

My training data is small, at only 75 samples, and thus each fold here would contain 25 samples. Are there combinations of these hyperparameters that would fail for a small data set?

Here the parameter grid I'm using

# hyperparameter grid
paramGrid_rf = { "n_estimators":[ int(x) for x in np.linspace( 200, 2000, 10)],
                 "max_features": [ None, 0.5, "sqrt", "log2"],
                 "max_depth": [ int(x) for x in np.linspace( 10, 110, 11)]+[None],
                 "min_samples_split": [ 1, 2, 5, 10],
                 "min_samples_leaf": [1, 2, 4], 
                 "bootstrap": [ True, False]}

# Initialize the RF classifier
rf = RandomForestClassifier()

# initialize the random 3-fold CV search of hyperparameters
rf_rnd = RandomizedSearchCV( estimator=rf, param_distributions=paramGrid_rf, n_iter=1000, cv=3, verbose=1, random_state=42)

# fit to the training data
rf_rnd.fit( X, y)

And the first five sets of parameters that gave a nan score are

{'n_estimators': 1200,
 'min_samples_split': 1,
 'min_samples_leaf': 1,
 'max_features': 'log2',
 'max_depth': 90,
 'bootstrap': True}
{'n_estimators': 200,
 'min_samples_split': 1,
 'min_samples_leaf': 1,
 'max_features': None,
 'max_depth': 80,
 'bootstrap': False}
{'n_estimators': 400,
 'min_samples_split': 1,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 60,
 'bootstrap': False}
{'n_estimators': 1000,
 'min_samples_split': 1,
 'min_samples_leaf': 2,
 'max_features': None,
 'max_depth': 60,
 'bootstrap': False}
{'n_estimators': 1400,
 'min_samples_split': 1,
 'min_samples_leaf': 1,
 'max_features': 0.5,
 'max_depth': 50,
 'bootstrap': True}

Best Answer

The cause of the nan score values was including a value of 1 as an option for min_samples_split. Although it is not explicitly stated in the documentation that this parameter cannot be 1, it makes sense when one stops to think about what this parameter means; one cannot split a node into subgroups if there is only 1 sample! This also explains why 25% of the permutations gave a nan score as there were four options for min_samples_split and one of them would cause an invalid parameter set.

The root cause was determined by following @BenReiniger's advice passing the error_score=raise option to RandomizedSearchCV, so thanks to him.

Related Question