Solved – SelectKBest score function with mixed categorical and continuous data

feature selectionpythonscikit learnvariable

I am building a classification model where my label is categorical (0 or 1). I want to use scikit-learn’s SelectKBest to select my top 10 features, but I’m not sure which score function to use. I thought I’d use chi2, but not all my variables are categorical. Which function works best with mixed variables (categorical, continuous, discrete)? I’ve seen several posts where people use f_classif, but isn’t ANOVA only valid if my label is continuous and predictor variables are categorical? I’m trying to find a score function that can handle all of my variables.

Best Answer

Try mutual_info_classif scoring function. It works with both continuous and discrete variables. You can specify a mask or indices of discrete features in discrete_features parameter:

>>> from functools import partial
>>> from sklearn.feature_selection import mutual_info_classif, SelectKBest
>>> discrete_feat_idx = [1, 3] # an array with indices of discrete features
>>> score_func = partial(mutual_info_classif, discrete_features=discrete_feat_idx)
>>> s = SelectKBest(score_func)

But note that discrete does not always imply category. So if a feature is not a comparable discrete variable, I suspect the corresponding score will not make much sense.

Related Question