Solved – Do we still need to do feature selection while using Regularization algorithms

feature selectionlassomachine learningregressionregularization

I have one question with respect to need to use feature selection methods (Random forests feature importance value or Univariate feature selection methods etc) before running a statistical learning algorithm.

We know to avoid overfitting we can introduce regularization penalty on the weight vectors.

So if I want to do linear regression, then I could introduce the L2 or L1 or even Elastic net regularization parameters. To get sparse solutions, L1 penalty helps in feature selection.

Then is it still required to do feature selection before Running L1 regularizationn regression such as Lasso?. Technically Lasso is helping me reduce the features by L1 penalty then why is the feature selection needed before running the algo?

I read a research article saying that doing Anova then SVM gives better performance than using SVM alone. Now question is: SVM inherently does regularization using L2 norm. In order to maximise the margin, it is minimising the weight vector norm. So it is doing regularization in it's objective function. Then technically algorithms such as SVM should not be bothered about feature selection methods?. But the report still says doing Univariate Feature selection before normal SVM is more powerful.

Anyone with thoughts?

Best Answer

Feature selection sometimes improves the performance of regularized models, but in my experience it generally makes generalization performance worse. The reason for this is that the more choices we make regarding our model (including the values of the parameters, the choice of features, the setting of hyper-parameters, the choice of kernel...), the more data we need to make these choices reliably. Generally we make these choices by minimizing some criterion evaluated over a finite set of data, which means that the criterion inevitably has a non-zero variance. As a result, if we minimize the criterion too agressively, we can over-fit it, i.e. we can make choices that minimize the criterion because of features that depend on the particular sample on which it is evaluated, rather than because it will produce a genuine improvement in performance. I call this "over-fitting in model selection" to differentiate it from the more familiar form of over-fitting resulting from tuning the model parameters.

Now the SVM is an approximate implementation of a bound on generalization performance that does not depend on the dimensionality, so in principle, we can expect good performance without feature selection, provided the regularization parameters are correctly chosen. Most feature selection methods have no such performance "guarantees".

For L1 methods, I certainly wouldn't bother with feature selection, as the L1 criterion is generally effective in trimming features. The reason that it is effective is that it induces an ordering in which features enter and leave the model, which reduces the number of available choices in selecting features, and hence is less prone to over-fitting.

The best reason for feature selection is to find out which features are relevant/important. The worst reason for feature selection is to improve performance, for regularised models, generally it makes things worse. However, for some datasets, it can make a big difference, so the best thing to do is to try it and use a robust, unbiased performance evaluation scheme (e.g. nested cross-validation) to find out whether yours is one of those datasets.