I am trying to learn about feature selection, and I thought using make_classification in sklearn
would be helpful. I'm confused, though because the number of informative features I'm able to find isn't as many as expected.
I am using SelectKBest
to determine the number of features. The ones selected by this (either via chi2
or f_classif
) correlate well to which features are useful via training by RandomForestClassifier
or any other classifier.
I have been able to determine by adding repeated features, and seeing which ones repeat, that it is the first n features (n = number of intended informative) that are generated by make_classification
as being informative.
However, in many cases, the number of actually helpful features is less than the intended informative. (I have noticed the number of clusters has an impact.) For instance, n_informative
might be 3, but I'm only able to see that one is useful via SelectKBest
or actually training a classifier.
So my two questions are:
1.) How can I detect the importance of the features make_classification
is intending to be important?
2.) What distinguishes the important features chi2
/fclassif
are able to detect from the important features they are unable to detect?
The code I am using (output is below):
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, chi2
import pandas as pd
import numpy as np
np.random.seed(10)
def illustrate(n_informative, n_clusters_per_class):
data_set = make_classification(n_samples = 500,
n_features = 10,
n_informative = n_informative,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class = n_clusters_per_class,
weights=None,
flip_y=0.0,
class_sep=1.0,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle = False,
random_state = 6)
X,Y = pd.DataFrame(data_set[0]), pd.Series(data_set[1],name='class')
X = X + abs(X.min().min())
sel1 = SelectKBest(k=1)
sel1.fit(X,Y)
sel2 = SelectKBest(chi2, k=1)
sel2.fit(X,Y)
res = pd.concat([pd.Series(sel1.scores_,name='f_classif_score'),
pd.Series(sel1.pvalues_,name='f_classif_p_value'),
pd.Series(sel2.scores_, name='chi2_score'),
pd.Series(sel2.pvalues_,name='chi2_pvalue')],
axis=1).sort_values('f_classif_score',ascending=False)
print res
for n_informative in [1,2,3,4]:
for n_clusters_per_class in range(1, n_informative):
print 'Informative Features: {} Clusters Per Class : {}'.format(
n_informative, n_clusters_per_class)
illustrate(n_informative, n_clusters_per_class)
Output of Above Code:
Informative Features: 2 Clusters Per Class : 1
f_classif_score f_classif_p_value chi2_score chi2_pvalue
0 1016.973810 2.130399e-122 134.325167 4.638173e-31
1 772.724765 2.300631e-103 146.799731 8.679832e-34
5 4.078865 4.395792e-02 1.105015 2.931682e-01
8 1.979141 1.601046e-01 0.554276 4.565756e-01
7 1.374163 2.416583e-01 0.372371 5.417147e-01
3 0.443690 5.056552e-01 0.113065 7.366816e-01
4 0.197154 6.572205e-01 0.060201 8.061782e-01
9 0.186371 6.661408e-01 0.056129 8.127227e-01
6 0.169497 6.807367e-01 0.050526 8.221512e-01
2 0.054381 8.157042e-01 0.016877 8.966354e-01
Informative Features: 3 Clusters Per Class : 1
f_classif_score f_classif_p_value chi2_score chi2_pvalue
0 687.446137 7.661852e-96 162.798076 2.769074e-37
2 568.414329 2.215744e-84 175.119185 5.638711e-40
9 4.233500 4.015367e-02 1.353756 2.446226e-01
4 2.181651 1.402967e-01 0.649694 4.202221e-01
6 0.416503 5.189845e-01 0.127764 7.207621e-01
5 0.250830 6.167129e-01 0.067124 7.955711e-01
7 0.225946 6.347547e-01 0.068300 7.938284e-01
3 0.210548 6.465381e-01 0.065311 7.982908e-01
8 0.149100 6.995618e-01 0.046806 8.287169e-01
1 0.011565 9.144025e-01 0.003235 9.546456e-01
Informative Features: 3 Clusters Per Class : 2
f_classif_score f_classif_p_value chi2_score chi2_pvalue
2 812.090540 1.144207e-106 150.031081 1.706735e-34
0 106.629707 8.813981e-23 31.707663 1.792137e-08
7 3.907313 4.862763e-02 1.165847 2.802561e-01
5 1.941582 1.641185e-01 0.634154 4.258357e-01
9 1.456108 2.281233e-01 0.449901 5.023821e-01
6 1.010343 3.153089e-01 0.317138 5.733325e-01
3 0.918498 3.383347e-01 0.278306 5.978138e-01
4 0.892927 3.451437e-01 0.285967 5.928169e-01
1 0.206608 6.496370e-01 0.098889 7.531666e-01
8 0.106946 7.437854e-01 0.029129 8.644814e-01
Informative Features: 4 Clusters Per Class : 1
f_classif_score f_classif_p_value chi2_score chi2_pvalue
2 823.390874 1.344646e-107 126.561785 2.316755e-29
5 4.964055 2.632530e-02 1.234543 2.665253e-01
4 2.088944 1.489976e-01 0.511490 4.744944e-01
3 2.048932 1.529403e-01 0.812675 3.673306e-01
9 1.234054 2.671562e-01 0.254791 6.137213e-01
1 0.315991 5.742796e-01 0.041092 8.393598e-01
6 0.043817 8.342805e-01 0.010935 9.167180e-01
8 0.033963 8.538599e-01 0.007824 9.295150e-01
7 0.012199 9.120972e-01 0.002627 9.591195e-01
0 0.002108 9.634011e-01 0.000199 9.887401e-01
Informative Features: 4 Clusters Per Class : 2
f_classif_score f_classif_p_value chi2_score chi2_pvalue
2 59.446089 6.882444e-14 20.306324 6.598215e-06
3 45.413607 4.422173e-11 25.331602 4.827347e-07
6 4.355442 3.739881e-02 0.965005 3.259291e-01
7 2.444909 1.185419e-01 0.490491 4.837084e-01
9 1.508166 2.199992e-01 0.366551 5.448901e-01
5 1.438351 2.309767e-01 0.303560 5.816592e-01
1 0.956231 3.286131e-01 0.176588 6.743222e-01
8 0.886270 3.469467e-01 0.215632 6.423882e-01
4 0.175559 6.753984e-01 0.042743 8.362091e-01
0 0.064596 7.994786e-01 0.025981 8.719465e-01
Informative Features: 4 Clusters Per Class : 3
f_classif_score f_classif_p_value chi2_score chi2_pvalue
0 37.608756 1.762369e-09 15.340979 0.000090
3 35.104866 5.834908e-09 17.716788 0.000026
5 7.474495 6.480748e-03 1.632879 0.201305
8 6.424434 1.156120e-02 1.636956 0.200744
6 0.566897 4.518503e-01 0.130881 0.717521
4 0.225665 6.349655e-01 0.057623 0.810293
7 0.149020 6.996387e-01 0.031846 0.858367
2 0.033591 8.546550e-01 0.015237 0.901759
1 0.028674 8.656032e-01 0.011647 0.914058
9 0.004558 9.461984e-01 0.001164 0.972785
Informative Features: 2 Clusters Per Class : 1
f_classif_score f_classif_p_value chi2_score chi2_pvalue
0 1016.973810 2.130399e-122 134.325167 4.638173e-31
1 772.724765 2.300631e-103 146.799731 8.679832e-34
5 4.078865 4.395792e-02 1.105015 2.931682e-01
8 1.979141 1.601046e-01 0.554276 4.565756e-01
7 1.374163 2.416583e-01 0.372371 5.417147e-01
3 0.443690 5.056552e-01 0.113065 7.366816e-01
4 0.197154 6.572205e-01 0.060201 8.061782e-01
9 0.186371 6.661408e-01 0.056129 8.127227e-01
6 0.169497 6.807367e-01 0.050526 8.221512e-01
2 0.054381 8.157042e-01 0.016877 8.966354e-01
Informative Features: 3 Clusters Per Class : 1
f_classif_score f_classif_p_value chi2_score chi2_pvalue
0 687.446137 7.661852e-96 162.798076 2.769074e-37
2 568.414329 2.215744e-84 175.119185 5.638711e-40
9 4.233500 4.015367e-02 1.353756 2.446226e-01
4 2.181651 1.402967e-01 0.649694 4.202221e-01
6 0.416503 5.189845e-01 0.127764 7.207621e-01
5 0.250830 6.167129e-01 0.067124 7.955711e-01
7 0.225946 6.347547e-01 0.068300 7.938284e-01
3 0.210548 6.465381e-01 0.065311 7.982908e-01
8 0.149100 6.995618e-01 0.046806 8.287169e-01
1 0.011565 9.144025e-01 0.003235 9.546456e-01
Informative Features: 3 Clusters Per Class : 2
f_classif_score f_classif_p_value chi2_score chi2_pvalue
2 812.090540 1.144207e-106 150.031081 1.706735e-34
0 106.629707 8.813981e-23 31.707663 1.792137e-08
7 3.907313 4.862763e-02 1.165847 2.802561e-01
5 1.941582 1.641185e-01 0.634154 4.258357e-01
9 1.456108 2.281233e-01 0.449901 5.023821e-01
6 1.010343 3.153089e-01 0.317138 5.733325e-01
3 0.918498 3.383347e-01 0.278306 5.978138e-01
4 0.892927 3.451437e-01 0.285967 5.928169e-01
1 0.206608 6.496370e-01 0.098889 7.531666e-01
8 0.106946 7.437854e-01 0.029129 8.644814e-01
Informative Features: 4 Clusters Per Class : 1
f_classif_score f_classif_p_value chi2_score chi2_pvalue
2 823.390874 1.344646e-107 126.561785 2.316755e-29
5 4.964055 2.632530e-02 1.234543 2.665253e-01
4 2.088944 1.489976e-01 0.511490 4.744944e-01
3 2.048932 1.529403e-01 0.812675 3.673306e-01
9 1.234054 2.671562e-01 0.254791 6.137213e-01
1 0.315991 5.742796e-01 0.041092 8.393598e-01
6 0.043817 8.342805e-01 0.010935 9.167180e-01
8 0.033963 8.538599e-01 0.007824 9.295150e-01
7 0.012199 9.120972e-01 0.002627 9.591195e-01
0 0.002108 9.634011e-01 0.000199 9.887401e-01
Informative Features: 4 Clusters Per Class : 2
f_classif_score f_classif_p_value chi2_score chi2_pvalue
2 59.446089 6.882444e-14 20.306324 6.598215e-06
3 45.413607 4.422173e-11 25.331602 4.827347e-07
6 4.355442 3.739881e-02 0.965005 3.259291e-01
7 2.444909 1.185419e-01 0.490491 4.837084e-01
9 1.508166 2.199992e-01 0.366551 5.448901e-01
5 1.438351 2.309767e-01 0.303560 5.816592e-01
1 0.956231 3.286131e-01 0.176588 6.743222e-01
8 0.886270 3.469467e-01 0.215632 6.423882e-01
4 0.175559 6.753984e-01 0.042743 8.362091e-01
0 0.064596 7.994786e-01 0.025981 8.719465e-01
Informative Features: 4 Clusters Per Class : 3
f_classif_score f_classif_p_value chi2_score chi2_pvalue
0 37.608756 1.762369e-09 15.340979 0.000090
3 35.104866 5.834908e-09 17.716788 0.000026
5 7.474495 6.480748e-03 1.632879 0.201305
8 6.424434 1.156120e-02 1.636956 0.200744
6 0.566897 4.518503e-01 0.130881 0.717521
4 0.225665 6.349655e-01 0.057623 0.810293
7 0.149020 6.996387e-01 0.031846 0.858367
2 0.033591 8.546550e-01 0.015237 0.901759
1 0.028674 8.656032e-01 0.011647 0.914058
9 0.004558 9.461984e-01 0.001164 0.972785
Best Answer
I have to admit, I initially thought the
chi2
andf_classif
may be the culprits. I therefore quickly wrote the functions below:One looking at feature importances calculated by the random forest classifier:
And the other plotting the Regularisation Path:
To my surprise, these show similar results. The feature importances generated by the first one and the regularisation path generated by the second sometimes indicate the same number of informative features (especially for 2), but in most cases the informative features they indicate is less than what was provided to the
make_classification
function.Answers:
First to question 2) From my two functions above, it seems like the phenomenon is not specific to
chi2
orf_classif
scores. What these two scores do is already explained well here, so I am not going to repeat.1) The only thing I can think of here is that all of these methods are looking at individual feature importances of these variables. It is possible that the informative features are correlated within themselves, and accounting for one's impact in improving predictive performance may be rendering the others redundant. This is explained in this comprehensive (albeit slightly dated) review.