Solved – SelectKBest score function with mixed categorical and continuous data

feature selectionpythonscikit learnvariable

I am building a classification model where my label is categorical (0 or 1). I want to use scikit-learn’s SelectKBest to select my top 10 features, but I’m not sure which score function to use. I thought I’d use chi2, but not all my variables are categorical. Which function works best with mixed variables (categorical, continuous, discrete)? I’ve seen several posts where people use f_classif, but isn’t ANOVA only valid if my label is continuous and predictor variables are categorical? I’m trying to find a score function that can handle all of my variables.

Best Answer

Try mutual_info_classif scoring function. It works with both continuous and discrete variables. You can specify a mask or indices of discrete features in discrete_features parameter:

>>> from functools import partial
>>> from sklearn.feature_selection import mutual_info_classif, SelectKBest
>>> discrete_feat_idx = [1, 3] # an array with indices of discrete features
>>> score_func = partial(mutual_info_classif, discrete_features=discrete_feat_idx)
>>> s = SelectKBest(score_func)

But note that discrete does not always imply category. So if a feature is not a comparable discrete variable, I suspect the corresponding score will not make much sense.

Related Solutions

Solved – What kind of feature selection can Chi square test be used for

I think part of your confusion is about which types of variables a chi-squared can compare. Wikipedia says the following about this:

It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution.

Thus it compares frequency distributions, also known as counts, also known as non-negative numbers. The different frequency distributions are defined by the categorical variable; i.e. for each of the values of a categorical variable there needs to be a frequency distribution that can be compared to the other ones.

There are several ways to get the frequency distribution. It might be from a second categorical variable wherein the co-occurances with the first categorical variable are counted to get a discrete frequency distribution. Another option is to use a (multiple) numerical variable for different values of a categorical variable, it can (e.g.) sum the values of the numerical variable. In fact, if categorical variables are binarised the former is a specific version of the later.

Example

As an example look at these sets of variables:

x = ['mouse', 'cat', 'mouse', 'cat']
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

The categorical variables x and y can be compared by counting the co-occurances, and this is what happens with a chi-squared test:

                 'mouse'    'cat'
'wild'              1         0
'domesticated'      1         2

However, you can also binarise the values of 'x' and get the following variables:

x1 = [1, 0, 1, 0]
x2 = [0, 1, 0, 1]
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

Counting the values is now equal to summing the values that correspond to the value of z.

                 x1    x2
'wild'           1     0
'domesticated'   1     2

As you can see a single categorical variable (x) or multiple numerical variables (x1 and x2) are equally represented by in the contingency table. Thus chi-squared tests can be applied on a categorical variable (the label in sklearn) combined with another categorical variable or multiple numerical variables (the features in sklearn).

Solved – Using Linear Regression on text data

The plot doesn't look wrong. Your X axis is the word count of one word, after scaling. The Y axis is age. The vertical stacks result from always having an integer word count; there are 8 stacks corresponding to word counts of 0-7. The blue trend line shows that this word is a weak positive indicator for age.

The plot would be slightly clearer if you did not scale your input. Linear regression doesn't benefit from unit-variance scaling anyway.

Best Answer

Related Solutions

Solved – What kind of feature selection can Chi square test be used for

Solved – Using Linear Regression on text data

Related Question