Solved – How does scikit-learn perform $\chi^2$ feature selection on non-categorical features

feature selectionsvmtext mining

I'm experimenting with $\chi^2$ feature selection for some text classification tasks. I understand that $\chi^2$ test checks the dependencies B/T two categorical variables, so if we perform $\chi^2$ feature selection for a binary text classification problem with binary BOW vector representation, each $\chi^2$ test on each (feature, class) pair would be a very straightforward $\chi^2$ test with 1 degree of freedom.

Quoting from the documentation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2,

This score can be used to select the n_features features with the
highest values for the χ² (chi-square) statistic from X, which must
contain booleans or frequencies (e.g., term counts in document
classification), relative to the classes.

It seems to me that we we can also perform $\chi^2$ feature selection on DF (word counts) vector presentation.

My 1st question is: How does sklearn discretize the integer-valued feature into categorical?

My second question is similar to the first. From the demo codes here: http://scikit-learn.sourceforge.net/dev/auto_examples/document_classification_20newsgroups.html

It seems to me that we can also perform $\chi^2$ feature selection on a TF*IDF vector representation.

My 2nd question is: How does sklearn perform $\chi^2$ feature selection on real-valued features?

Best Answer

Found the answer here: https://stackoverflow.com/questions/14573030/perform-chi-2-feature-selection-on-tf-and-tfidf-vectors

Think of the NULL hypothesis as "document class has no influence over feature frequency".

Related Solutions

Solved – What kind of feature selection can Chi square test be used for

I think part of your confusion is about which types of variables a chi-squared can compare. Wikipedia says the following about this:

It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution.

Thus it compares frequency distributions, also known as counts, also known as non-negative numbers. The different frequency distributions are defined by the categorical variable; i.e. for each of the values of a categorical variable there needs to be a frequency distribution that can be compared to the other ones.

There are several ways to get the frequency distribution. It might be from a second categorical variable wherein the co-occurances with the first categorical variable are counted to get a discrete frequency distribution. Another option is to use a (multiple) numerical variable for different values of a categorical variable, it can (e.g.) sum the values of the numerical variable. In fact, if categorical variables are binarised the former is a specific version of the later.

Example

As an example look at these sets of variables:

x = ['mouse', 'cat', 'mouse', 'cat']
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

The categorical variables x and y can be compared by counting the co-occurances, and this is what happens with a chi-squared test:

                 'mouse'    'cat'
'wild'              1         0
'domesticated'      1         2

However, you can also binarise the values of 'x' and get the following variables:

x1 = [1, 0, 1, 0]
x2 = [0, 1, 0, 1]
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

Counting the values is now equal to summing the values that correspond to the value of z.

                 x1    x2
'wild'           1     0
'domesticated'   1     2

As you can see a single categorical variable (x) or multiple numerical variables (x1 and x2) are equally represented by in the contingency table. Thus chi-squared tests can be applied on a categorical variable (the label in sklearn) combined with another categorical variable or multiple numerical variables (the features in sklearn).

Solved – Is F test used for feature selection only for features with numerical and continuous domain

Assuming you are in the context of stepwise regression, the scale of the feature does not matter. The F-test is done on the difference of RSS values between the smaller and larger model as calculated on the outcome variable (also taking into to account the difference in the number of parameters).

For more information see: http://en.wikipedia.org/wiki/F-test#Regression_problems

Best Answer

Related Solutions

Solved – What kind of feature selection can Chi square test be used for

Solved – Is F test used for feature selection only for features with numerical and continuous domain

Related Question