I am building a classification model where my label is categorical (0 or 1). I want to use scikit-learn’s SelectKBest to select my top 10 features, but I’m not sure which score function to use. I thought I’d use chi2, but not all my variables are categorical. Which function works best with mixed variables (categorical, continuous, discrete)? I’ve seen several posts where people use f_classif, but isn’t ANOVA only valid if my label is continuous and predictor variables are categorical? I’m trying to find a score function that can handle all of my variables.
Solved – SelectKBest score function with mixed categorical and continuous data
feature selectionpythonscikit learnvariable
Related Solutions
I think part of your confusion is about which types of variables a chi-squared can compare. Wikipedia says the following about this:
It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution.
Thus it compares frequency distributions, also known as counts, also known as non-negative numbers. The different frequency distributions are defined by the categorical variable; i.e. for each of the values of a categorical variable there needs to be a frequency distribution that can be compared to the other ones.
There are several ways to get the frequency distribution. It might be from a second categorical variable wherein the co-occurances with the first categorical variable are counted to get a discrete frequency distribution. Another option is to use a (multiple) numerical variable for different values of a categorical variable, it can (e.g.) sum the values of the numerical variable. In fact, if categorical variables are binarised the former is a specific version of the later.
Example
As an example look at these sets of variables:
x = ['mouse', 'cat', 'mouse', 'cat']
z = ['wild', 'domesticated', 'domesticated', 'domesticated']
The categorical variables x
and y
can be compared by counting the co-occurances, and this is what happens with a chi-squared test:
'mouse' 'cat'
'wild' 1 0
'domesticated' 1 2
However, you can also binarise the values of 'x' and get the following variables:
x1 = [1, 0, 1, 0]
x2 = [0, 1, 0, 1]
z = ['wild', 'domesticated', 'domesticated', 'domesticated']
Counting the values is now equal to summing the values that correspond to the value of z
.
x1 x2
'wild' 1 0
'domesticated' 1 2
As you can see a single categorical variable (x
) or multiple numerical variables (x1
and x2
) are equally represented by in the contingency table. Thus chi-squared tests can be applied on a categorical variable (the label in sklearn) combined with another categorical variable or multiple numerical variables (the features in sklearn).
The plot doesn't look wrong. Your X axis is the word count of one word, after scaling. The Y axis is age. The vertical stacks result from always having an integer word count; there are 8 stacks corresponding to word counts of 0-7. The blue trend line shows that this word is a weak positive indicator for age.
The plot would be slightly clearer if you did not scale your input. Linear regression doesn't benefit from unit-variance scaling anyway.
Best Answer
Try
mutual_info_classif
scoring function. It works with both continuous and discrete variables. You can specify a mask or indices of discrete features indiscrete_features
parameter:But note that discrete does not always imply category. So if a feature is not a comparable discrete variable, I suspect the corresponding score will not make much sense.