Solved – How exactly does Chi-square feature selection work

chi-squared-testfeature selection

I know that for each feature-class pair, the value of the chi-square statistic is computed and compared against a threshold.

I am a little confused though. If there are $m$ features and $k$ classes, how does one build the contingency table? How does one decide which features to keep and which ones to remove?

Any clarification will be much appreciated. Thanks in advance

Best Answer

The chi-square test is a statistical test of independence to determine the dependency of two variables. It shares similarities with coefficient of determination, R². However, chi-square test is only applicable to categorical or nominal data while R² is only applicable to numeric data.

From the definition, of chi-square we can easily deduce the application of chi-square technique in feature selection. Suppose you have a target variable (i.e., the class label) and some other features (feature variables) that describes each sample of the data. Now, we calculate chi-square statistics between every feature variable and the target variable and observe the existence of a relationship between the variables and the target. If the target variable is independent of the feature variable, we can discard that feature variable. If they are dependent, the feature variable is very important.

Mathematical details are described here:http://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html

For continuous variables, chi-square can be applied after "Binning" the variables.

An example in R, shamelessly copied from FSelector

# Use HouseVotes84 data from  mlbench package
library(mlbench)# For data
library(FSelector)#For method
data(HouseVotes84)

#Calculate the chi square statistics 
weights<- chi.squared(Class~., HouseVotes84)

# Print the results 
print(weights)

# Select top five variables
subset<- cutoff.k(weights, 5)

# Print the final formula that can be used in classification
f<- as.simple.formula(subset, "Class")
print(f)

Not related to so much in feature selection but the video below discusses the chisquare in detail https://www.youtube.com/watch?time_continue=5&v=IrZOKSGShC8