I have to admit, I initially thought the chi2
and f_classif
may be the culprits. I therefore quickly wrote the functions below:
One looking at feature importances calculated by the random forest classifier:
def get_rf_feat_importances(X,Y):
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X, Y)
return rf.feature_importances_
And the other plotting the Regularisation Path:
def get_LARS_Lasso_path(X,Y):
import matplotlib.pyplot as plt
from sklearn import linear_model
alphas, _, coefs = linear_model.lars_path(X.values, Y.values, method='lasso', verbose=True)
xx = np.sum(np.abs(coefs.T), axis=1)
xx /= xx[-1]
plt.plot(xx, coefs.T)
ymin, ymax = plt.ylim()
plt.vlines(xx, ymin, ymax, linestyle='dashed')
plt.xlabel('|coef| / max|coef|')
plt.ylabel('Coefficients')
plt.title('LASSO Path')
plt.axis('tight')
plt.savefig('Lasso_Path.png')
To my surprise, these show similar results. The feature importances generated by the first one and the regularisation path generated by the second sometimes indicate the same number of informative features (especially for 2), but in most cases the informative features they indicate is less than what was provided to the make_classification
function.
Answers:
First to question 2) From my two functions above, it seems like the phenomenon is not specific to chi2
or f_classif
scores. What these two scores do is already explained well here, so I am not going to repeat.
1) The only thing I can think of here is that all of these methods are looking at individual feature importances of these variables. It is possible that the informative features are correlated within themselves, and accounting for one's impact in improving predictive performance may be rendering the others redundant. This is explained in this comprehensive (albeit slightly dated) review.
In Section 4.2, we introduced nested subset methods that provide a
useful ranking of subsets, not of individual variables: some variables
may have a low rank because they are redundant and yet be highly
relevant.
Best Answer
Your result is correct, XGB recognizes that many of your features are not important and didn't use them in the process of building decision trees. You can force XGB to use all of them by increasing max tree depth setting, but you are overfitting the data this way.
Back to your problem, only 84 features are used by XGB and therefore discarding others produces very similar result