Solved – Feature selection using chi squared for continuous features

chi-squared-testcontinuous datafeature selectionscikit learn

I'm looking at univariate feature selection. A method that is often described, is to look at the p-values for a $\chi^2$-test. However, I'm confused as to how this works for continuous variables.

1. How can the $\chi^2$-test work for feature selection for continuous variables?
I have always thought this test works for counts. It appears to me you have to bin the data in some way, but the outcomes are dependent on the binning you choose. I'm also interested in how this works for a combination of continuous and categorical variables.

2. Is it a problem that this test is scale dependent?
My second concern is that the test is scale dependent. This is not a problem for counts, which have no units of measurement, but it can have great impact on feature selection for continuous variables that are measured in some units of measurement (see Example).

Example

Showing the test is scale-dependent for variables with units of measurement:

Let's look at the original example from: http://scikit-learn.org/stable/modules/feature_selection.html

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
selector = SelectKBest(chi2, k=2)
selector.fit(X, y)
print(selector.pvalues_)
print(selector.get_support())

Output:

[False False True True]
[4.47e-03 1.657e-01 5.94e-26 2.50e-15]

Now let's imagine we had recorded the first and third column not in cm, but in mm. Obviously, this doesn't change the dependence of the class type on sepal and petal length.
However, the p-values change strongly, and accordingly, the selected columns change:

X[:, 0] = 10*X[:, 0]
X[:, 2] = 10*X[:, 2]
selector.fit(X, y)
print(selector.pvalues_)
print(selector.get_support())

Output

[True False True False] 
[3.23e-024 1.66e-001 5.50e-253 2.50e-015]

If I had also recorded the 2nd column in mm instead of cm, that would also have given me a significant p-value.

I believe this had to with the fact that the method does not implement any binning but sums all values and compares that to the expected sum. Additionally, I believe the fact that the numerator in the $\chi^2$ is squared while the denominator is not adds to the problem.

Best Answer

I think you confuse the data itself (which can be continuous) with the fact that when you talk about data, you actually talk about samples, which are discrete.

The $\chi^2$ test (in wikipedia and the model selection by $\chi^2$ criterion) is a test to check for independence of sampled data. I.e. when you have two (or more) of sources of the data (i.e. different features), and you want to select only features that are mutually independent, you can test it by rejecting the Null hypothesis (i.e. data samples are dependant) if the probability to encounter such a sample (under the Null hypothesis), i.e. the p value, is smaller than some threshold value (e.g., p < 0.05).

So now for your questions,

The $\chi^2$ test do work only on categorical data, as you must count the occurences of the samples in each category to use it, but as I've mentioned above, when you use it, you actually have samples in hand, so one thing you can do is to divide your samples into categories based on thresholds (e.g., $cat_1: x \in [th_1 < x < th_2], cat_2: x \in [th_2 < x < th_3]$, etc.) and count all the samples that fall into each category.
As for the scales - you are obviously must use the same scales when you discretize your samples, otherwise it won't make any sense, but when you conduct the $\chi^2$ test itself, as you've correctly pointed out, you are dealing with counts, so they won't have any scales anyway.

Cheers.

Related Solutions

Solved – How does scikit-learn perform $\chi^2$ feature selection on non-categorical features

Found the answer here: https://stackoverflow.com/questions/14573030/perform-chi-2-feature-selection-on-tf-and-tfidf-vectors

Think of the NULL hypothesis as "document class has no influence over feature frequency".

Solved – What kind of feature selection can Chi square test be used for

I think part of your confusion is about which types of variables a chi-squared can compare. Wikipedia says the following about this:

It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution.

Thus it compares frequency distributions, also known as counts, also known as non-negative numbers. The different frequency distributions are defined by the categorical variable; i.e. for each of the values of a categorical variable there needs to be a frequency distribution that can be compared to the other ones.

There are several ways to get the frequency distribution. It might be from a second categorical variable wherein the co-occurances with the first categorical variable are counted to get a discrete frequency distribution. Another option is to use a (multiple) numerical variable for different values of a categorical variable, it can (e.g.) sum the values of the numerical variable. In fact, if categorical variables are binarised the former is a specific version of the later.

Example

As an example look at these sets of variables:

x = ['mouse', 'cat', 'mouse', 'cat']
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

The categorical variables x and y can be compared by counting the co-occurances, and this is what happens with a chi-squared test:

                 'mouse'    'cat'
'wild'              1         0
'domesticated'      1         2

However, you can also binarise the values of 'x' and get the following variables:

x1 = [1, 0, 1, 0]
x2 = [0, 1, 0, 1]
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

Counting the values is now equal to summing the values that correspond to the value of z.

                 x1    x2
'wild'           1     0
'domesticated'   1     2

As you can see a single categorical variable (x) or multiple numerical variables (x1 and x2) are equally represented by in the contingency table. Thus chi-squared tests can be applied on a categorical variable (the label in sklearn) combined with another categorical variable or multiple numerical variables (the features in sklearn).

Best Answer

Related Solutions

Solved – How does scikit-learn perform $\chi^2$ feature selection on non-categorical features

Solved – What kind of feature selection can Chi square test be used for

Related Question