Solved – How does scikit-learn perform $\chi^2$ feature selection on non-categorical features

feature selectionsvmtext mining

I'm experimenting with $\chi^2$ feature selection for some text classification tasks. I understand that $\chi^2$ test checks the dependencies B/T two categorical variables, so if we perform $\chi^2$ feature selection for a binary text classification problem with binary BOW vector representation, each $\chi^2$ test on each (feature, class) pair would be a very straightforward $\chi^2$ test with 1 degree of freedom.

Quoting from the documentation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2,

This score can be used to select the n_features features with the
highest values for the χ² (chi-square) statistic from X, which must
contain booleans or frequencies (e.g., term counts in document
classification), relative to the classes.

It seems to me that we we can also perform $\chi^2$ feature selection on DF (word counts) vector presentation.

My 1st question is: How does sklearn discretize the integer-valued feature into categorical?

My second question is similar to the first. From the demo codes here: http://scikit-learn.sourceforge.net/dev/auto_examples/document_classification_20newsgroups.html

It seems to me that we can also perform $\chi^2$ feature selection on a TF*IDF vector representation.

My 2nd question is: How does sklearn perform $\chi^2$ feature selection on real-valued features?

Best Answer

Found the answer here: https://stackoverflow.com/questions/14573030/perform-chi-2-feature-selection-on-tf-and-tfidf-vectors

Think of the NULL hypothesis as "document class has no influence over feature frequency".