Solved – Being able to detect the important features sklearn.make_classification generates

feature selectionmachine learningpythonscikit learn

I am trying to learn about feature selection, and I thought using make_classification in sklearn would be helpful. I'm confused, though because the number of informative features I'm able to find isn't as many as expected.

I am using SelectKBest to determine the number of features. The ones selected by this (either via chi2 or f_classif) correlate well to which features are useful via training by RandomForestClassifier or any other classifier.

I have been able to determine by adding repeated features, and seeing which ones repeat, that it is the first n features (n = number of intended informative) that are generated by make_classification as being informative.

However, in many cases, the number of actually helpful features is less than the intended informative. (I have noticed the number of clusters has an impact.) For instance, n_informative might be 3, but I'm only able to see that one is useful via SelectKBest or actually training a classifier.

So my two questions are:

1.) How can I detect the importance of the features make_classification is intending to be important?

2.) What distinguishes the important features chi2/fclassif are able to detect from the important features they are unable to detect?

The code I am using (output is below):

from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, chi2
import pandas as pd
import numpy as np

np.random.seed(10)
def illustrate(n_informative, n_clusters_per_class):
    data_set = make_classification(n_samples = 500,
                                   n_features = 10,
                                   n_informative = n_informative,
                                   n_redundant=0,
                                   n_repeated=0,
                                   n_classes=2,
                                   n_clusters_per_class = n_clusters_per_class,
                                   weights=None,
                                   flip_y=0.0,
                                   class_sep=1.0,
                                   hypercube=True,
                                   shift=0.0,
                                   scale=1.0,
                                   shuffle = False,
                                   random_state = 6)

    X,Y  = pd.DataFrame(data_set[0]), pd.Series(data_set[1],name='class')
    X = X + abs(X.min().min())
    sel1 = SelectKBest(k=1)
    sel1.fit(X,Y)
    sel2 = SelectKBest(chi2, k=1)
    sel2.fit(X,Y)
    res = pd.concat([pd.Series(sel1.scores_,name='f_classif_score'),
                     pd.Series(sel1.pvalues_,name='f_classif_p_value'),
                     pd.Series(sel2.scores_, name='chi2_score'),
                     pd.Series(sel2.pvalues_,name='chi2_pvalue')],
                    axis=1).sort_values('f_classif_score',ascending=False)
    print res

for n_informative in [1,2,3,4]:
    for n_clusters_per_class in range(1, n_informative):
        print 'Informative Features: {} Clusters Per Class : {}'.format(
            n_informative, n_clusters_per_class)
        illustrate(n_informative, n_clusters_per_class)

Output of Above Code:

Informative Features: 2 Clusters Per Class : 1
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
0      1016.973810      2.130399e-122  134.325167  4.638173e-31
1       772.724765      2.300631e-103  146.799731  8.679832e-34
5         4.078865       4.395792e-02    1.105015  2.931682e-01
8         1.979141       1.601046e-01    0.554276  4.565756e-01
7         1.374163       2.416583e-01    0.372371  5.417147e-01
3         0.443690       5.056552e-01    0.113065  7.366816e-01
4         0.197154       6.572205e-01    0.060201  8.061782e-01
9         0.186371       6.661408e-01    0.056129  8.127227e-01
6         0.169497       6.807367e-01    0.050526  8.221512e-01
2         0.054381       8.157042e-01    0.016877  8.966354e-01
Informative Features: 3 Clusters Per Class : 1
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
0       687.446137       7.661852e-96  162.798076  2.769074e-37
2       568.414329       2.215744e-84  175.119185  5.638711e-40
9         4.233500       4.015367e-02    1.353756  2.446226e-01
4         2.181651       1.402967e-01    0.649694  4.202221e-01
6         0.416503       5.189845e-01    0.127764  7.207621e-01
5         0.250830       6.167129e-01    0.067124  7.955711e-01
7         0.225946       6.347547e-01    0.068300  7.938284e-01
3         0.210548       6.465381e-01    0.065311  7.982908e-01
8         0.149100       6.995618e-01    0.046806  8.287169e-01
1         0.011565       9.144025e-01    0.003235  9.546456e-01
Informative Features: 3 Clusters Per Class : 2
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
2       812.090540      1.144207e-106  150.031081  1.706735e-34
0       106.629707       8.813981e-23   31.707663  1.792137e-08
7         3.907313       4.862763e-02    1.165847  2.802561e-01
5         1.941582       1.641185e-01    0.634154  4.258357e-01
9         1.456108       2.281233e-01    0.449901  5.023821e-01
6         1.010343       3.153089e-01    0.317138  5.733325e-01
3         0.918498       3.383347e-01    0.278306  5.978138e-01
4         0.892927       3.451437e-01    0.285967  5.928169e-01
1         0.206608       6.496370e-01    0.098889  7.531666e-01
8         0.106946       7.437854e-01    0.029129  8.644814e-01
Informative Features: 4 Clusters Per Class : 1
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
2       823.390874      1.344646e-107  126.561785  2.316755e-29
5         4.964055       2.632530e-02    1.234543  2.665253e-01
4         2.088944       1.489976e-01    0.511490  4.744944e-01
3         2.048932       1.529403e-01    0.812675  3.673306e-01
9         1.234054       2.671562e-01    0.254791  6.137213e-01
1         0.315991       5.742796e-01    0.041092  8.393598e-01
6         0.043817       8.342805e-01    0.010935  9.167180e-01
8         0.033963       8.538599e-01    0.007824  9.295150e-01
7         0.012199       9.120972e-01    0.002627  9.591195e-01
0         0.002108       9.634011e-01    0.000199  9.887401e-01
Informative Features: 4 Clusters Per Class : 2
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
2        59.446089       6.882444e-14   20.306324  6.598215e-06
3        45.413607       4.422173e-11   25.331602  4.827347e-07
6         4.355442       3.739881e-02    0.965005  3.259291e-01
7         2.444909       1.185419e-01    0.490491  4.837084e-01
9         1.508166       2.199992e-01    0.366551  5.448901e-01
5         1.438351       2.309767e-01    0.303560  5.816592e-01
1         0.956231       3.286131e-01    0.176588  6.743222e-01
8         0.886270       3.469467e-01    0.215632  6.423882e-01
4         0.175559       6.753984e-01    0.042743  8.362091e-01
0         0.064596       7.994786e-01    0.025981  8.719465e-01
Informative Features: 4 Clusters Per Class : 3
   f_classif_score  f_classif_p_value  chi2_score  chi2_pvalue
0        37.608756       1.762369e-09   15.340979     0.000090
3        35.104866       5.834908e-09   17.716788     0.000026
5         7.474495       6.480748e-03    1.632879     0.201305
8         6.424434       1.156120e-02    1.636956     0.200744
6         0.566897       4.518503e-01    0.130881     0.717521
4         0.225665       6.349655e-01    0.057623     0.810293
7         0.149020       6.996387e-01    0.031846     0.858367
2         0.033591       8.546550e-01    0.015237     0.901759
1         0.028674       8.656032e-01    0.011647     0.914058
9         0.004558       9.461984e-01    0.001164     0.972785
Informative Features: 2 Clusters Per Class : 1
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
0      1016.973810      2.130399e-122  134.325167  4.638173e-31
1       772.724765      2.300631e-103  146.799731  8.679832e-34
5         4.078865       4.395792e-02    1.105015  2.931682e-01
8         1.979141       1.601046e-01    0.554276  4.565756e-01
7         1.374163       2.416583e-01    0.372371  5.417147e-01
3         0.443690       5.056552e-01    0.113065  7.366816e-01
4         0.197154       6.572205e-01    0.060201  8.061782e-01
9         0.186371       6.661408e-01    0.056129  8.127227e-01
6         0.169497       6.807367e-01    0.050526  8.221512e-01
2         0.054381       8.157042e-01    0.016877  8.966354e-01
Informative Features: 3 Clusters Per Class : 1
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
0       687.446137       7.661852e-96  162.798076  2.769074e-37
2       568.414329       2.215744e-84  175.119185  5.638711e-40
9         4.233500       4.015367e-02    1.353756  2.446226e-01
4         2.181651       1.402967e-01    0.649694  4.202221e-01
6         0.416503       5.189845e-01    0.127764  7.207621e-01
5         0.250830       6.167129e-01    0.067124  7.955711e-01
7         0.225946       6.347547e-01    0.068300  7.938284e-01
3         0.210548       6.465381e-01    0.065311  7.982908e-01
8         0.149100       6.995618e-01    0.046806  8.287169e-01
1         0.011565       9.144025e-01    0.003235  9.546456e-01
Informative Features: 3 Clusters Per Class : 2
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
2       812.090540      1.144207e-106  150.031081  1.706735e-34
0       106.629707       8.813981e-23   31.707663  1.792137e-08
7         3.907313       4.862763e-02    1.165847  2.802561e-01
5         1.941582       1.641185e-01    0.634154  4.258357e-01
9         1.456108       2.281233e-01    0.449901  5.023821e-01
6         1.010343       3.153089e-01    0.317138  5.733325e-01
3         0.918498       3.383347e-01    0.278306  5.978138e-01
4         0.892927       3.451437e-01    0.285967  5.928169e-01
1         0.206608       6.496370e-01    0.098889  7.531666e-01
8         0.106946       7.437854e-01    0.029129  8.644814e-01
Informative Features: 4 Clusters Per Class : 1
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
2       823.390874      1.344646e-107  126.561785  2.316755e-29
5         4.964055       2.632530e-02    1.234543  2.665253e-01
4         2.088944       1.489976e-01    0.511490  4.744944e-01
3         2.048932       1.529403e-01    0.812675  3.673306e-01
9         1.234054       2.671562e-01    0.254791  6.137213e-01
1         0.315991       5.742796e-01    0.041092  8.393598e-01
6         0.043817       8.342805e-01    0.010935  9.167180e-01
8         0.033963       8.538599e-01    0.007824  9.295150e-01
7         0.012199       9.120972e-01    0.002627  9.591195e-01
0         0.002108       9.634011e-01    0.000199  9.887401e-01
Informative Features: 4 Clusters Per Class : 2
   f_classif_score  f_classif_p_value  chi2_score   chi2_pvalue
2        59.446089       6.882444e-14   20.306324  6.598215e-06
3        45.413607       4.422173e-11   25.331602  4.827347e-07
6         4.355442       3.739881e-02    0.965005  3.259291e-01
7         2.444909       1.185419e-01    0.490491  4.837084e-01
9         1.508166       2.199992e-01    0.366551  5.448901e-01
5         1.438351       2.309767e-01    0.303560  5.816592e-01
1         0.956231       3.286131e-01    0.176588  6.743222e-01
8         0.886270       3.469467e-01    0.215632  6.423882e-01
4         0.175559       6.753984e-01    0.042743  8.362091e-01
0         0.064596       7.994786e-01    0.025981  8.719465e-01
Informative Features: 4 Clusters Per Class : 3
   f_classif_score  f_classif_p_value  chi2_score  chi2_pvalue
0        37.608756       1.762369e-09   15.340979     0.000090
3        35.104866       5.834908e-09   17.716788     0.000026
5         7.474495       6.480748e-03    1.632879     0.201305
8         6.424434       1.156120e-02    1.636956     0.200744
6         0.566897       4.518503e-01    0.130881     0.717521
4         0.225665       6.349655e-01    0.057623     0.810293
7         0.149020       6.996387e-01    0.031846     0.858367
2         0.033591       8.546550e-01    0.015237     0.901759
1         0.028674       8.656032e-01    0.011647     0.914058
9         0.004558       9.461984e-01    0.001164     0.972785

Best Answer

I have to admit, I initially thought the chi2 and f_classif may be the culprits. I therefore quickly wrote the functions below:

One looking at feature importances calculated by the random forest classifier:

def get_rf_feat_importances(X,Y):

    from sklearn.ensemble import RandomForestClassifier
    rf = RandomForestClassifier()
    rf.fit(X, Y)

    return rf.feature_importances_

And the other plotting the Regularisation Path:

def get_LARS_Lasso_path(X,Y):

    import matplotlib.pyplot as plt
    from sklearn import linear_model
    alphas, _, coefs = linear_model.lars_path(X.values, Y.values, method='lasso', verbose=True)

    xx = np.sum(np.abs(coefs.T), axis=1)
    xx /= xx[-1]

    plt.plot(xx, coefs.T)
    ymin, ymax = plt.ylim()
    plt.vlines(xx, ymin, ymax, linestyle='dashed')
    plt.xlabel('|coef| / max|coef|')
    plt.ylabel('Coefficients')
    plt.title('LASSO Path')
    plt.axis('tight')
    plt.savefig('Lasso_Path.png')

To my surprise, these show similar results. The feature importances generated by the first one and the regularisation path generated by the second sometimes indicate the same number of informative features (especially for 2), but in most cases the informative features they indicate is less than what was provided to the make_classification function.

Answers:

First to question 2) From my two functions above, it seems like the phenomenon is not specific to chi2 or f_classif scores. What these two scores do is already explained well here, so I am not going to repeat.

1) The only thing I can think of here is that all of these methods are looking at individual feature importances of these variables. It is possible that the informative features are correlated within themselves, and accounting for one's impact in improving predictive performance may be rendering the others redundant. This is explained in this comprehensive (albeit slightly dated) review.

In Section 4.2, we introduced nested subset methods that provide a useful ranking of subsets, not of individual variables: some variables may have a low rank because they are redundant and yet be highly relevant.