Solved – What kind of feature selection can Chi square test be used for

chi-squared-testfeature selectionhypothesis testingindependencescikit learn

  1. Here I am asking about what others commonly do to use chi squared
    test for feature selection wrt outcome in supervised learning. If I
    understand correctly, do they test the independence between each
    feature and the outcome, and compare the p values between the tests
    for each feature?

  2. In http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test,

    Pearson's chi-squared test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed
    difference between the sets arose by chance.

    A test of independence assesses whether paired observations on two variables, expressed in a contingency table, are independent of each
    other (e.g. polling responses from people of different nationalities
    to see if one's nationality is related to the response).

    So must the two variables whose independence is tested by the test
    be categorical, or discrete (allowing ordered besides categorical), but not contiuous?

  3. From http://scikit-learn.org/stable/modules/feature_selection.html,
    they

    perform a $\chi^2$ test to the iris dataset to retrieve only the two best features.

    In the iris dataset, all the features are numerical and continuous
    valued, and the outcome is class labels (categorical). How does the chi squared independence test apply to continuous features?

    To apply chi
    squared independence test to the dataset, do we first convert the
    continuous features into discrete features, by binning (i.e. first
    discretizing the features' continuous domains into bins, and then
    replacing the features with occurrences of the features' values in
    the bins)?

    Occurrences in several bins form a multinomial feature (either occur or not in each bin), so chi squared independence test can apply to them, right?

    By the way I guess, can we apply chi squared independence test to features and outcomes of any kind, correct?

    For the outcome part, we can select features for not only classification, but also for regression, by chi square independence test, by binning the continuous outcome, right?

  4. The scikit learn site also says

    Compute chi-squared stats between each non-negative feature and class.

    This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must
    contain only non-negative features such as booleans or frequencies
    (e.g., term counts in document classification), relative to the
    classes.

    Why does the test require nonnegative features?

    If the features don't have signs but are categorical or discrete,
    can the test still apply to them? (See my part 1)

    If the features are negative, we can always bin their domains and
    replace them with their occurrences (just like what I guess for
    applying the test to the iris dataset, see part 2), right?

Note: I guess Scikit Learn follows general principles, and that is what I am asking for here. If not, then it is still all right.

Best Answer

I think part of your confusion is about which types of variables a chi-squared can compare. Wikipedia says the following about this:

It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution.

Thus it compares frequency distributions, also known as counts, also known as non-negative numbers. The different frequency distributions are defined by the categorical variable; i.e. for each of the values of a categorical variable there needs to be a frequency distribution that can be compared to the other ones.

There are several ways to get the frequency distribution. It might be from a second categorical variable wherein the co-occurances with the first categorical variable are counted to get a discrete frequency distribution. Another option is to use a (multiple) numerical variable for different values of a categorical variable, it can (e.g.) sum the values of the numerical variable. In fact, if categorical variables are binarised the former is a specific version of the later.

Example

As an example look at these sets of variables:

x = ['mouse', 'cat', 'mouse', 'cat']
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

The categorical variables x and y can be compared by counting the co-occurances, and this is what happens with a chi-squared test:

                 'mouse'    'cat'
'wild'              1         0
'domesticated'      1         2

However, you can also binarise the values of 'x' and get the following variables:

x1 = [1, 0, 1, 0]
x2 = [0, 1, 0, 1]
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

Counting the values is now equal to summing the values that correspond to the value of z.

                 x1    x2
'wild'           1     0
'domesticated'   1     2

As you can see a single categorical variable (x) or multiple numerical variables (x1 and x2) are equally represented by in the contingency table. Thus chi-squared tests can be applied on a categorical variable (the label in sklearn) combined with another categorical variable or multiple numerical variables (the features in sklearn).

Related Question