I am learning about the feature selection using Python and SciKit learn. I came across the SelectKBest class, however it is unclear what kind of test is performed.
Select features according to the k highest scores.
The only reference for k-score I found was from the following wikipedia page.
Cohen's kappa coefficient is a statistic which measures inter-rater agreement for qualitative (categorical) items.
Is the k-score from SelectKBest class refers to the above? Thank you.
Best Answer
No,
SelectKBest
works differently.It takes as a parameter a score function, which must be applicable to a pair ($X$, $y$). The score function must return an array of scores, one for each feature $X[:, i]$ of $X$ (additionally, it can also return p-values, but these are neither needed nor required).
SelectKBest
then simply retains the first $k$ features of $X$ with the highest scores.So, for example, if you pass
chi2
as a score function,SelectKBest
will compute the chi2 statistic between each feature of $X$ and $y$ (assumed to be class labels). A small value will mean the feature is independent of $y$. A large value will mean the feature is non-randomly related to $y$, and so likely to provide important information. Only $k$ features will be retained.Finally,
SelectKBest
has a default behaviour implemented, so you can writeselect = SelectKBest()
and then callselect.fit_transform(X, y)
(in fact I saw people do this). In this caseSelectKBest
uses thef_classif
score function. This interpretes the values of $y$ as class labels and computes, for each feature $X[:, i]$ of $X$, an $F$-statistic. The formula used is exactly the one given here: one way ANOVA F-test, with $K$ the number of distinct values of $y$. A large score suggests that the means of the $K$ groups are not all equal. This is not very informative, and is true only when some rather stringent conditions are met: for example, the values $X[:, i]$ must come from normally distributed populations, and the population variance of the $K$ groups must be the same. I don't see why this should hold in practice, and without this assumption the $F$-values are meaningless. So usingSelectKBest()
carelessly might throw out many features for the wrong reasons.