Feature Selection – Using SelectKBest for Feature Selection in Python with SciKit Learn

feature selectionmachine learningscikit learnself-study

I am learning about the feature selection using Python and SciKit learn. I came across the SelectKBest class, however it is unclear what kind of test is performed.

Select features according to the k highest scores.

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest

The only reference for k-score I found was from the following wikipedia page.

Cohen's kappa coefficient is a statistic which measures inter-rater agreement for qualitative (categorical) items.

Is the k-score from SelectKBest class refers to the above? Thank you.

Best Answer

No, SelectKBest works differently.

It takes as a parameter a score function, which must be applicable to a pair ($X$, $y$). The score function must return an array of scores, one for each feature $X[:, i]$ of $X$ (additionally, it can also return p-values, but these are neither needed nor required). SelectKBest then simply retains the first $k$ features of $X$ with the highest scores.

So, for example, if you pass chi2 as a score function, SelectKBest will compute the chi2 statistic between each feature of $X$ and $y$ (assumed to be class labels). A small value will mean the feature is independent of $y$. A large value will mean the feature is non-randomly related to $y$, and so likely to provide important information. Only $k$ features will be retained.

Finally, SelectKBest has a default behaviour implemented, so you can write select = SelectKBest() and then call select.fit_transform(X, y) (in fact I saw people do this). In this case SelectKBest uses the f_classif score function. This interpretes the values of $y$ as class labels and computes, for each feature $X[:, i]$ of $X$, an $F$-statistic. The formula used is exactly the one given here: one way ANOVA F-test, with $K$ the number of distinct values of $y$. A large score suggests that the means of the $K$ groups are not all equal. This is not very informative, and is true only when some rather stringent conditions are met: for example, the values $X[:, i]$ must come from normally distributed populations, and the population variance of the $K$ groups must be the same. I don't see why this should hold in practice, and without this assumption the $F$-values are meaningless. So using SelectKBest() carelessly might throw out many features for the wrong reasons.

Related Question