Solved – Alternative to AIC for feature selection in classification

aicboostingfeature selection

I want to know what are the most common methods for feature selection in classification problems (binary and mutli-class).

I see in Chapter 6 of Zumel and Mount that they use AIC before they train classification algorithms (trees, logistic regression, kNN) on a classification problem with both categorical and numerical features. They compute AIC as $2\,\left(\log L- \log L_{base}\right)-2^S$ for categorical variables, and $2\,\left(\log L- \log L_{base}\right)-1$ for numeric variables ($L$ is likelihood, $L_{base}$ is likelihood of the saturated model, $S$ is entropy). They keep features with AIC above a certain threshold, which presumably can be modified to improve algorithm performance.

What are the alternatives, and when should I consider using them? In specific, how should I do feature selection if I plan to train with a boosting algorithm (adaboost or gbm)? Are there any dangers in throwing all the noisy variables (without any feature selection) to adaboost or gbm, as they seem to not be perfectly immune to overfitting?

Best Answer

James et al. (2013), An Introduction to Statistical Learning strongly focus on cross-validation (CV) as a means to prevent overfitting. See also Hastie et al. (2009), The Elements of Statistical Learning.

Under certain circumstances, AIC and CV essentially do the same thing, but there are important cases where CV is more flexible.

The link links to the free e-version of the book, so I hope you bear with me if I do not rehash their explanations here in what could only be an inferior way of doing so.

Related Solutions

Solved – What kind of feature selection can Chi square test be used for

I think part of your confusion is about which types of variables a chi-squared can compare. Wikipedia says the following about this:

It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution.

Thus it compares frequency distributions, also known as counts, also known as non-negative numbers. The different frequency distributions are defined by the categorical variable; i.e. for each of the values of a categorical variable there needs to be a frequency distribution that can be compared to the other ones.

There are several ways to get the frequency distribution. It might be from a second categorical variable wherein the co-occurances with the first categorical variable are counted to get a discrete frequency distribution. Another option is to use a (multiple) numerical variable for different values of a categorical variable, it can (e.g.) sum the values of the numerical variable. In fact, if categorical variables are binarised the former is a specific version of the later.

Example

As an example look at these sets of variables:

x = ['mouse', 'cat', 'mouse', 'cat']
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

The categorical variables x and y can be compared by counting the co-occurances, and this is what happens with a chi-squared test:

                 'mouse'    'cat'
'wild'              1         0
'domesticated'      1         2

However, you can also binarise the values of 'x' and get the following variables:

x1 = [1, 0, 1, 0]
x2 = [0, 1, 0, 1]
z = ['wild', 'domesticated', 'domesticated', 'domesticated']

Counting the values is now equal to summing the values that correspond to the value of z.

                 x1    x2
'wild'           1     0
'domesticated'   1     2

As you can see a single categorical variable (x) or multiple numerical variables (x1 and x2) are equally represented by in the contingency table. Thus chi-squared tests can be applied on a categorical variable (the label in sklearn) combined with another categorical variable or multiple numerical variables (the features in sklearn).

Related Question