Solved – Extract important features

feature selectionmachine learningstatistical-learning

Here is my situation:
– A huge amount of data
– 600 features
– Only one class is provided
Now, my question is how can I reduce the number of features to important ones? In another word, all of these features (with data) are intending to predict only one class. but some of features have large impact on the prediction (means their variation come to higher probability).

Best Answer

There is no universal answer regarding feature selection. Most of the time, they also depend on the learning method you use.

There are model free methods, refered to as filters. that allow you to select predictors. Per example, ranking them with respect to the correlation with the output and retaining the variables with the highest correlation (in modulus), the highest mutual information... See http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf, for an introduction and examples.

The "bad thing" (all feature selection methods have drawbacks) is that they an miss interactions (two features on their own may be poor predictors, but their product may happen to be a good one). This article http://www.public.asu.edu/~huanliu/papers/ijcai07.pdf tackles this issue.

There are model specific methods Random forest, gradient boosting, per example, propose variable importance estimates. After running a random forest, you can use the importance produced to train a new random forest on the n% of the most important variables (n=70 is an example, you should cross-validate to find the best percentage to keep). These are called wrappers.

And you have embedded methods that perform both training and variable selection in the same time. As goodepic states it, LASSO and Elastic net allows you to select variables in the case of a linear model. These are a little bit specific as variable selection is performed during the training step.

At last, there is domain knowledge. Which corresponds to your prior belief about the relevancy of the predictors with respect to your target.

Related Question