Solved – What kind of feature selection can Chi square test be used for

chi-squared-testfeature selectionhypothesis testingindependencescikit learn

Here I am asking about what others commonly do to use chi squared
test for feature selection wrt outcome in supervised learning. If I
understand correctly, do they test the independence between each
feature and the outcome, and compare the p values between the tests
for each feature?
In http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test,

Pearson's chi-squared test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed
difference between the sets arose by chance.

…

A test of independence assesses whether paired observations on two variables, expressed in a contingency table, are independent of each
other (e.g. polling responses from people of different nationalities
to see if one's nationality is related to the response).

So must the two variables whose independence is tested by the test
be categorical, or discrete (allowing ordered besides categorical), but not contiuous?
From http://scikit-learn.org/stable/modules/feature_selection.html,
they

perform a $\chi^2$ test to the iris dataset to retrieve only the two best features.

In the iris dataset, all the features are numerical and continuous
valued, and the outcome is class labels (categorical). How does the chi squared independence test apply to continuous features?

To apply chi
squared independence test to the dataset, do we first convert the
continuous features into discrete features, by binning (i.e. first
discretizing the features' continuous domains into bins, and then
replacing the features with occurrences of the features' values in
the bins)?

Occurrences in several bins form a multinomial feature (either occur or not in each bin), so chi squared independence test can apply to them, right?

By the way I guess, can we apply chi squared independence test to features and outcomes of any kind, correct?

For the outcome part, we can select features for not only classification, but also for regression, by chi square independence test, by binning the continuous outcome, right?
The scikit learn site also says

Compute chi-squared stats between each non-negative feature and class.

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must
contain only non-negative features such as booleans or frequencies
(e.g., term counts in document classification), relative to the
classes.

Why does the test require nonnegative features?

If the features don't have signs but are categorical or discrete,
can the test still apply to them? (See my part 1)

If the features are negative, we can always bin their domains and
replace them with their occurrences (just like what I guess for
applying the test to the iris dataset, see part 2), right?

Note: I guess Scikit Learn follows general principles, and that is what I am asking for here. If not, then it is still all right.

Best Answer

I think part of your confusion is about which types of variables a chi-squared can compare. Wikipedia says the following about this:

It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution.

Thus it compares frequency distributions, also known as counts, also known as non-negative numbers. The different frequency distributions are defined by the categorical variable; i.e. for each of the values of a categorical variable there needs to be a frequency distribution that can be compared to the other ones.

There are several ways to get the frequency distribution. It might be from a second categorical variable wherein the co-occurances with the first categorical variable are counted to get a discrete frequency distribution. Another option is to use a (multiple) numerical variable for different values of a categorical variable, it can (e.g.) sum the values of the numerical variable. In fact, if categorical variables are binarised the former is a specific version of the later.

Example

As an example look at these sets of variables:

x = ['mouse', 'cat', 'mouse', 'cat']
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

The categorical variables x and y can be compared by counting the co-occurances, and this is what happens with a chi-squared test:

                 'mouse'    'cat'
'wild'              1         0
'domesticated'      1         2

However, you can also binarise the values of 'x' and get the following variables:

x1 = [1, 0, 1, 0]
x2 = [0, 1, 0, 1]
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

Counting the values is now equal to summing the values that correspond to the value of z.

                 x1    x2
'wild'           1     0
'domesticated'   1     2

As you can see a single categorical variable (x) or multiple numerical variables (x1 and x2) are equally represented by in the contingency table. Thus chi-squared tests can be applied on a categorical variable (the label in sklearn) combined with another categorical variable or multiple numerical variables (the features in sklearn).

Best Answer

Related Solutions

Solved – Is F test used for feature selection only for features with numerical and continuous domain

Solved – Feature selection using chi squared for continuous features

Related Question