Solved – scikit-learn feature selection on k-fold loop

classificationcross-validationdata miningfeature selectionscikit learn

I am using the iterator of StratifiedKFold from sklearn and i've noticed that i must include a process of feature selection on my experiment. I've seen that it must not consider testdata for feature selection (for wrapping methods).

Therefore, i couldn't find a good approach to implement using the default StratifiedKFold loop. Should i perform a feature selection on each fold (for iteration) and then train my classifier with the reduced features?

Just for clearance, here's my desired experiment:

Select best features from original dataset
Balance the classes with SMOTE (must select features first)
Apply cross-validation

Here is the for loop of cross validation:


    mlp_acc = []
    adaboost_acc = []

    # The LOOP - Where do i apply feature selection?
    for train_index, test_index in skf.split(X, y):

        clf = MLPClassifier(hidden_layer_sizes=(20),verbose=10,
                        learning_rate_init=0.5, max_iter=2000, 
                        activation='logistic', solver='sgd', shuffle=True, random_state=30)

        adaboost_clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=10, random_state=50), random_state=40)

        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Normalizing data with MinMaxScaler
        X_train = sc_X.fit_transform(X_train)
        X_test = sc_X.transform(X_test)

        X_train, y_train = makeOverSamplesSMOTE(X_train,y_train)

        adaboost_clf.fit(X_train, y_train)
        adaboost_pred = adaboost_clf.predict(X_test)

        clf.fit(X_train, y_train)
        clf_pred = clf.predict(X_test)

        # Append accuracies to the accuracy array of each classifier
        mlp_acc.append(balanced_accuracy_score(y_test, clf_pred))
        adaboost_acc.append(balanced_accuracy_score(y_test, adaboost_pred))

I'd be glad for some guidance considering i'm using StratifiedKFold

Best Answer

Until you're ready to train your full model on all data, you should treat your out-of-sample data sets as nonexistent when you're doing work for the in-sample data. If you have 1000 observations split into 5 sets of 200 for 5-fold CV, you pretend like one of the folds doesn't exist when you work on the remaining 800 observations. If you want to run PCA, for instance, you run PCA on the 800 points and then apply the results of that diagonalization to the out-of-sample 200 (I believe that the sklearn functions do this automatically). But you certainly don't diagonalize the covariance matrix for the entire set of 1000 until you're training your "production" model, at which point, you've already decided on a model (neural network with X layers of Y neurons, or linear regression with L2 regularization for some $\lambda$ that you optimized by cross validation, for instance).

In your case, let's stick with this idea of doing 5-fold CV on 1000 observations. You have your five folds of 200 observations each: $F_1,\dots,F_5$. The first fold you leave out if $F_1$. Do your feature selection and model development on $F_2,\dots,F_5$, pretending like $F_1$ doesn't exist. Let's say you select features $X_1$, $X_4$ and $X_7$. Train your model on those features. Now apply your model to the out-of-sample data in $F_1$. Let's say you get an accuracy of 75%.

Now do it again, leaving out $F_2$ and training on the rest. Let's say you select features $X_2$, $X_4$, and $X_7$. Train your model on the 800 observations that exclude $F_2$. Accuracy out-of-sample is 77%. Now do it again with $F_3$ left out, then again with $F_4$ left out, and again with $F_5$ left out. Your accuracy scores are, say, 80%, 65%, and 79%.

At no point does the feature selection process see the out-of-sample data! That's always hidden from model development in order to simulate the real application of machine learning where it is supposed to work on data that the developer may not even know exists. For instance, Siri should be able to figure out a baby's first words, even though Siri has never been trained on other speech by the child let alone that exact piece of audio.

When you do show the all five folds to your model is when you want to make a "production" model. That occurs when you are confident that you have a general model that will work well, but you still get the same sort of issue where your in-sample data are $F_1,\dots,F_5$ and your out-of-sample data are whatever happens out in the world (a baby's first words for a speech recognition tool, for instance).

With all of that stated, the answer to your question:

"Should i perform a feature selection on each fold (for iteration) and then train my classifier with the reduced features?"

Not for each fold, but for each set of in-sample data that leaves out a fold for out-of-sample testing.

Edit: I think I'm using "fold" incorrectly. What I'm calling $F_1$ etc are not the folds. They're partitions of the data. A fold would then be $F_1,\dots,F_4$ as training data and $F_5$ as validation data. So I say that the answer to your question is yes.

Related Solutions

Solved – Identifying filtered features after feature selection with scikit learn

There are two things that you can do:

Check coef_ param and detect which column was ignored
Use the same model for input data transformation using method transform

Small modifications for your example

>>> from sklearn.svm import LinearSVC
>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import train_test_split
>>>
>>> iris = load_iris()
>>> x_train, x_test, y_train, y_test = train_test_split(
...     iris.data, iris.target, train_size=0.7
... )
>>>
>>> svc = LinearSVC(C=0.01, penalty="l1", dual=False)
>>>
>>> X_train_new = svc.fit_transform(x_train, y_train)
>>> print(X_train_new.shape)
(105, 3)
>>>
>>> X_test_new = svc.transform(x_test)
>>> print(X_test_new.shape)
(45, 3)
>>>
>>> print(svc.coef_)
[[ 0.          0.10895557 -0.20603044  0.        ]
 [-0.00514987 -0.05676593  0.          0.        ]
 [ 0.         -0.09839843  0.02111212  0.        ]]

As you see method transform do all job for you. And also from coef_ matrix you can see that last column just a zero vector, so you model ignore last column from data

Feature Selection – Using SelectKBest for Feature Selection in Python with SciKit Learn

No, SelectKBest works differently.

It takes as a parameter a score function, which must be applicable to a pair ($X$, $y$). The score function must return an array of scores, one for each feature $X[:, i]$ of $X$ (additionally, it can also return p-values, but these are neither needed nor required). SelectKBest then simply retains the first $k$ features of $X$ with the highest scores.

So, for example, if you pass chi2 as a score function, SelectKBest will compute the chi2 statistic between each feature of $X$ and $y$ (assumed to be class labels). A small value will mean the feature is independent of $y$. A large value will mean the feature is non-randomly related to $y$, and so likely to provide important information. Only $k$ features will be retained.

Finally, SelectKBest has a default behaviour implemented, so you can write select = SelectKBest() and then call select.fit_transform(X, y) (in fact I saw people do this). In this case SelectKBest uses the f_classif score function. This interpretes the values of $y$ as class labels and computes, for each feature $X[:, i]$ of $X$, an $F$-statistic. The formula used is exactly the one given here: one way ANOVA F-test, with $K$ the number of distinct values of $y$. A large score suggests that the means of the $K$ groups are not all equal. This is not very informative, and is true only when some rather stringent conditions are met: for example, the values $X[:, i]$ must come from normally distributed populations, and the population variance of the $K$ groups must be the same. I don't see why this should hold in practice, and without this assumption the $F$-values are meaningless. So using SelectKBest() carelessly might throw out many features for the wrong reasons.

Best Answer

Related Solutions

Solved – Identifying filtered features after feature selection with scikit learn

Feature Selection – Using SelectKBest for Feature Selection in Python with SciKit Learn

Related Question