Solved – Alternative to AIC for feature selection in classification

aicboostingfeature selection

I want to know what are the most common methods for feature selection in classification problems (binary and mutli-class).

I see in Chapter 6 of Zumel and Mount that they use AIC before they train classification algorithms (trees, logistic regression, kNN) on a classification problem with both categorical and numerical features. They compute AIC as $2\,\left(\log L- \log L_{base}\right)-2^S$ for categorical variables, and $2\,\left(\log L- \log L_{base}\right)-1$ for numeric variables ($L$ is likelihood, $L_{base}$ is likelihood of the saturated model, $S$ is entropy). They keep features with AIC above a certain threshold, which presumably can be modified to improve algorithm performance.

What are the alternatives, and when should I consider using them? In specific, how should I do feature selection if I plan to train with a boosting algorithm (adaboost or gbm)? Are there any dangers in throwing all the noisy variables (without any feature selection) to adaboost or gbm, as they seem to not be perfectly immune to overfitting?

Best Answer

James et al. (2013), An Introduction to Statistical Learning strongly focus on cross-validation (CV) as a means to prevent overfitting. See also Hastie et al. (2009), The Elements of Statistical Learning.

Under certain circumstances, AIC and CV essentially do the same thing, but there are important cases where CV is more flexible.

The link links to the free e-version of the book, so I hope you bear with me if I do not rehash their explanations here in what could only be an inferior way of doing so.

Related Question