Solved – Ensemble models perform worse than single one

ensemble learningpythonscikit learn

In my model testing, I tried to use model ensembling (blending in this case) to get better results. However the ensemble cannot beat single RandomForrestClassifier.

In first layer, I train meta-estimators to create train and test prediction sets

Then in second layer I train another set of models on those prediction sets and combine the predictions in single one.

I think I have there two things which worsen the prediction.

  • When combining predictions in 1st layer for blend_test I am using mean

  • Same with 2nd layer, when combining predictions for b_pred_m, I am using mean too. I tried weighted, but it did not help.

I would appreciate any tips how to improve my ensemble model.

def run():
   n_folds = 
   clfs = [RandomForestClassifier(n_estimators = 200, class_weight={0:1,1:2}),
            ExtraTreesClassifier(n_estimators = 200, criterion = 'gini'),
            GradientBoostingClassifier(n_estimators = 200),
            KNeighborsClassifier(),
            LogisticRegression(penalty='l1',class_weight={0:1,1:2}),
            GaussianNB()]
    clfs_name = ['RandomForestClassifier', 'ExtraTreesClassifier', 
                 'GradientBoostingClassifier', 'KNeighborsClassifier', 
                 'LogisticRegression', 'GaussianNB']


    # Ready for cross validation
    skf = list(StratifiedKFold(y_train, n_folds))

    # Pre-allocate the data
    blend_train = np.zeros((X_train.shape[0], len(clfs))) # Number of training data x Number of classifiers
    blend_test = np.zeros((X_test.shape[0], len(clfs))) # Number of testing data x Number of classifiers
    for j, clf in enumerate(clfs):
        print 'Training classifier: %s' % clfs_name[j]
        # Number of testing data x Number of folds , 
        # we will take the mean of the predictions later
        blend_test_j = np.zeros((X_test.shape[0], len(skf)))

        for i, (train_index, cv_index) in enumerate(skf):
            print 'Fold [%s]' % (i)

            X_train_f = X_train[train_index] #output from PCA, array returned  
            y_train_f = y_train.iloc[train_index] #iloc if using pandas 
            X_cv = X_train[cv_index]
            y_cv = y_train.iloc[cv_index]

            #Train on train part of fold
            clf.fit(X_train_f, y_train_f)

            #Predict on cv_index part of fold
            blend_train[cv_index, j] = clf.predict(X_cv)
            #N-folds, so i will have n-predictions on x_test
            blend_test_j[:,i] = clf.predict(X_test)

        #Take mean of n-predictions
        blend_test[:,j] = blend_test_j.mean(1)

    #Blending, use different classifier
    b_clfs = [svm.SVC(C=1.0, kernel='linear'),
              RandomForestClassifier(n_estimators = 100),
              KNeighborsClassifier()] 
    b_clfs_name = ['SVC','RandomForestClassifier','KNeighborsClassifier']

    b_pred = np.zeros((blend_test.shape[0], len(b_clfs)))    
    for y, b_clf in enumerate(b_clfs):
        print 'Blending classifier: %s' % b_clfs_name[y]

        b_clf.fit(blend_train, y_train)
        b_pred[:,y] = b_clf.predict(blend_test)

        print 'Accuracy = %s' % (metrics.accuracy_score(y_test, b_pred[:,y]))
        print 'Precision = %s' % (metrics.precision_score(y_test, b_pred[:,y]))
        print 'Recall = %s' % (metrics.recall_score(y_test, b_pred[:,y]))

    #Final Prediction
    print '---- Final score ----'
    #b_pred_m = np.round_(np.mean(b_pred, axis=1)) #mean for each row
    b_pred_m = np.round_(np.ma.average(b_pred, axis=1,)) #weights=[0,0,0] #mean for each row 

    print 'Accuracy = %s' % (metrics.accuracy_score(y_test, b_pred_m))
    print 'Precision = %s' % (metrics.precision_score(y_test, b_pred_m))
    print 'Recall = %s' % (metrics.recall_score(y_test, b_pred_m))

Best Answer

There are several approaches that you might consider taking.

Firstly, it might be worth seeing if you can weight the contribution of each model to the final ensemble by some metric based on how well they perform on your validation set.

Secondly, it might be a good idea to look at the output of each model and calculate how well correlated they are - if you have very highly correlated results then it might be better to drop some of those from the ensemble and instead combine the least correlated results (the hypothesis being that the correct guesses should be correlated and the errors uncorrelated).

You may have seen it already but there is some excellent advice about this in the context of a Kaggle competition here.