I have a classification problem with numeric features and binary class value. Is ANOVA F-value in Python (see here) a good technique for the feature selection?
Solved – ANOVA F-value for feature selection
classificationfeature selectionmachine learningpython
Related Solutions
For feature selection, we need a scoring function as well as a search method to optimize the scoring function.
You may use RF as a feature ranking method if you define some relevant importance score. RF will select features based on random with replacement method and group every subset in a separate subspace (called random subspace). One scoring function of importance could be based on assigning the accuracy of every tree for every feature in that random subspace. Then, you do this for every separate tree. Since, the source of generating the subspaces is random, you may put a threshold for computing the importance score.
Summary:
Step1: If feature X2 appears in 25% of the trees, then, score it. Otherwise, do not consider ranking the feature because we do not have sufficient information about its performance
Step2: Now, assign the performance score of every tree in which X2 appears to X2 and average the score. For example: perf(Tree1) = 0.85 perf(Tree2) = 0.70 perf(Tree3) = 0.30
Then, the importance of feature X2 = (0.85+0.70+0.30)/3 = 0.6167
You may consider a more advanced setting by including the split depth of the feature or the information gain value in the decision tree. There can be many ways to design a scoring function based on decision trees and RF.
Regarding the search method, your recursive method seems reasonable as a way to select the top ranked ones.
Finally, you may use RF either as a classifier or a regression model in selecting your features since both of them would supply you with a performance score. The score is indicative as it is based on the out-of-bag OOB samples and you may not consider cross-validation in a simpler setting.
James et al. (2013), An Introduction to Statistical Learning strongly focus on cross-validation (CV) as a means to prevent overfitting. See also Hastie et al. (2009), The Elements of Statistical Learning.
Under certain circumstances, AIC and CV essentially do the same thing, but there are important cases where CV is more flexible.
The link links to the free e-version of the book, so I hope you bear with me if I do not rehash their explanations here in what could only be an inferior way of doing so.
Best Answer
Late answer but at least one!
In general yes, and in particular depends! F-value is a very good criterion for detecting the best individual variables (I'll explain why I don't call it feature ... wait!) for classification.
Why Individual?
F-value is proper for variable ranking. So it is sequentially applied to all variables and will tell you which one is more discriminating according to classes. And it does that very well!
Variable vs Feature
This is not a standard terminology but I used it to tell you something. Look at the figure bellow. Horizontal feature is better than vertical so f-value ranks it higher. But more than that: horizontal feature is good enough for classification i.e. one of your feature is good enough for classification task without any manipulation!
But look at the second figure on the left. None of variables is good enough for classification i.e. the discriminating feature is not in the set of your original variables so F-value does not tell you much. Here you need to find a new feature which discriminate the classes.
LDA
What if we write F-value as a function of our data in which the higher value of function indicates the higher F-value? Then we have optimization techniques to solve this maximization problem and find the axis which is not inside our variables but helps us to compute a new feature based on which the F-value is maximized! It is shown on the right.
Hope it helped! Good luck!