Solved – scikit-learn feature selection on k-fold loop

classificationcross-validationdata miningfeature selectionscikit learn

I am using the iterator of StratifiedKFold from sklearn and i've noticed that i must include a process of feature selection on my experiment. I've seen that it must not consider testdata for feature selection (for wrapping methods).

Therefore, i couldn't find a good approach to implement using the default StratifiedKFold loop. Should i perform a feature selection on each fold (for iteration) and then train my classifier with the reduced features?

Just for clearance, here's my desired experiment:

  1. Select best features from original dataset
  2. Balance the classes with SMOTE (must select features first)
  3. Apply cross-validation

Here is the for loop of cross validation:


    mlp_acc = []
    adaboost_acc = []

    # The LOOP - Where do i apply feature selection?
    for train_index, test_index in skf.split(X, y):

        clf = MLPClassifier(hidden_layer_sizes=(20),verbose=10,
                        learning_rate_init=0.5, max_iter=2000, 
                        activation='logistic', solver='sgd', shuffle=True, random_state=30)

        adaboost_clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=10, random_state=50), random_state=40)

        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Normalizing data with MinMaxScaler
        X_train = sc_X.fit_transform(X_train)
        X_test = sc_X.transform(X_test)

        X_train, y_train = makeOverSamplesSMOTE(X_train,y_train)

        adaboost_clf.fit(X_train, y_train)
        adaboost_pred = adaboost_clf.predict(X_test)

        clf.fit(X_train, y_train)
        clf_pred = clf.predict(X_test)

        # Append accuracies to the accuracy array of each classifier
        mlp_acc.append(balanced_accuracy_score(y_test, clf_pred))
        adaboost_acc.append(balanced_accuracy_score(y_test, adaboost_pred))

I'd be glad for some guidance considering i'm using StratifiedKFold

Best Answer

Until you're ready to train your full model on all data, you should treat your out-of-sample data sets as nonexistent when you're doing work for the in-sample data. If you have 1000 observations split into 5 sets of 200 for 5-fold CV, you pretend like one of the folds doesn't exist when you work on the remaining 800 observations. If you want to run PCA, for instance, you run PCA on the 800 points and then apply the results of that diagonalization to the out-of-sample 200 (I believe that the sklearn functions do this automatically). But you certainly don't diagonalize the covariance matrix for the entire set of 1000 until you're training your "production" model, at which point, you've already decided on a model (neural network with X layers of Y neurons, or linear regression with L2 regularization for some $\lambda$ that you optimized by cross validation, for instance).

In your case, let's stick with this idea of doing 5-fold CV on 1000 observations. You have your five folds of 200 observations each: $F_1,\dots,F_5$. The first fold you leave out if $F_1$. Do your feature selection and model development on $F_2,\dots,F_5$, pretending like $F_1$ doesn't exist. Let's say you select features $X_1$, $X_4$ and $X_7$. Train your model on those features. Now apply your model to the out-of-sample data in $F_1$. Let's say you get an accuracy of 75%.

Now do it again, leaving out $F_2$ and training on the rest. Let's say you select features $X_2$, $X_4$, and $X_7$. Train your model on the 800 observations that exclude $F_2$. Accuracy out-of-sample is 77%. Now do it again with $F_3$ left out, then again with $F_4$ left out, and again with $F_5$ left out. Your accuracy scores are, say, 80%, 65%, and 79%.

At no point does the feature selection process see the out-of-sample data! That's always hidden from model development in order to simulate the real application of machine learning where it is supposed to work on data that the developer may not even know exists. For instance, Siri should be able to figure out a baby's first words, even though Siri has never been trained on other speech by the child let alone that exact piece of audio.

When you do show the all five folds to your model is when you want to make a "production" model. That occurs when you are confident that you have a general model that will work well, but you still get the same sort of issue where your in-sample data are $F_1,\dots,F_5$ and your out-of-sample data are whatever happens out in the world (a baby's first words for a speech recognition tool, for instance).

With all of that stated, the answer to your question:

"Should i perform a feature selection on each fold (for iteration) and then train my classifier with the reduced features?"

Not for each fold, but for each set of in-sample data that leaves out a fold for out-of-sample testing.

Edit: I think I'm using "fold" incorrectly. What I'm calling $F_1$ etc are not the folds. They're partitions of the data. A fold would then be $F_1,\dots,F_4$ as training data and $F_5$ as validation data. So I say that the answer to your question is yes.

Related Question