In my model testing, I tried to use model ensembling (blending in this case) to get better results. However the ensemble cannot beat single RandomForrestClassifier.
In first layer, I train meta-estimators to create train and test prediction sets
Then in second layer I train another set of models on those prediction sets and combine the predictions in single one.
I think I have there two things which worsen the prediction.
-
When combining predictions in 1st layer for
blend_test
I am using mean -
Same with 2nd layer, when combining predictions for
b_pred_m
, I am using mean too. I tried weighted, but it did not help.
I would appreciate any tips how to improve my ensemble model.
def run():
n_folds =
clfs = [RandomForestClassifier(n_estimators = 200, class_weight={0:1,1:2}),
ExtraTreesClassifier(n_estimators = 200, criterion = 'gini'),
GradientBoostingClassifier(n_estimators = 200),
KNeighborsClassifier(),
LogisticRegression(penalty='l1',class_weight={0:1,1:2}),
GaussianNB()]
clfs_name = ['RandomForestClassifier', 'ExtraTreesClassifier',
'GradientBoostingClassifier', 'KNeighborsClassifier',
'LogisticRegression', 'GaussianNB']
# Ready for cross validation
skf = list(StratifiedKFold(y_train, n_folds))
# Pre-allocate the data
blend_train = np.zeros((X_train.shape[0], len(clfs))) # Number of training data x Number of classifiers
blend_test = np.zeros((X_test.shape[0], len(clfs))) # Number of testing data x Number of classifiers
for j, clf in enumerate(clfs):
print 'Training classifier: %s' % clfs_name[j]
# Number of testing data x Number of folds ,
# we will take the mean of the predictions later
blend_test_j = np.zeros((X_test.shape[0], len(skf)))
for i, (train_index, cv_index) in enumerate(skf):
print 'Fold [%s]' % (i)
X_train_f = X_train[train_index] #output from PCA, array returned
y_train_f = y_train.iloc[train_index] #iloc if using pandas
X_cv = X_train[cv_index]
y_cv = y_train.iloc[cv_index]
#Train on train part of fold
clf.fit(X_train_f, y_train_f)
#Predict on cv_index part of fold
blend_train[cv_index, j] = clf.predict(X_cv)
#N-folds, so i will have n-predictions on x_test
blend_test_j[:,i] = clf.predict(X_test)
#Take mean of n-predictions
blend_test[:,j] = blend_test_j.mean(1)
#Blending, use different classifier
b_clfs = [svm.SVC(C=1.0, kernel='linear'),
RandomForestClassifier(n_estimators = 100),
KNeighborsClassifier()]
b_clfs_name = ['SVC','RandomForestClassifier','KNeighborsClassifier']
b_pred = np.zeros((blend_test.shape[0], len(b_clfs)))
for y, b_clf in enumerate(b_clfs):
print 'Blending classifier: %s' % b_clfs_name[y]
b_clf.fit(blend_train, y_train)
b_pred[:,y] = b_clf.predict(blend_test)
print 'Accuracy = %s' % (metrics.accuracy_score(y_test, b_pred[:,y]))
print 'Precision = %s' % (metrics.precision_score(y_test, b_pred[:,y]))
print 'Recall = %s' % (metrics.recall_score(y_test, b_pred[:,y]))
#Final Prediction
print '---- Final score ----'
#b_pred_m = np.round_(np.mean(b_pred, axis=1)) #mean for each row
b_pred_m = np.round_(np.ma.average(b_pred, axis=1,)) #weights=[0,0,0] #mean for each row
print 'Accuracy = %s' % (metrics.accuracy_score(y_test, b_pred_m))
print 'Precision = %s' % (metrics.precision_score(y_test, b_pred_m))
print 'Recall = %s' % (metrics.recall_score(y_test, b_pred_m))
Best Answer
There are several approaches that you might consider taking.
Firstly, it might be worth seeing if you can weight the contribution of each model to the final ensemble by some metric based on how well they perform on your validation set.
Secondly, it might be a good idea to look at the output of each model and calculate how well correlated they are - if you have very highly correlated results then it might be better to drop some of those from the ensemble and instead combine the least correlated results (the hypothesis being that the correct guesses should be correlated and the errors uncorrelated).
You may have seen it already but there is some excellent advice about this in the context of a Kaggle competition here.