Logistic Regression – Why Use Regularization Instead of Feature Selection?

feature selectionlogisticloss-functionsoverfittingregularization

For a non-linearly separable problem, when there are enough features, we can make the data linearly separable. It seems to me that for logistic regression, the reason of overfitting is always excessive number of features. So why people mostly use l1, specially l2 regularization to shrink $$w$$ but not use feature selection? With the correct features (that can't perfectly sperate the data), does $$w$$ also become large?

Feature selection involves many degrees of freedom in minimisng the model/feature selection criterion, one binary degree of freedom for each feature. Regularisation on the other hand usually has only a single continuous degree of freedom. This imposes an implicit ordering in which features enter and leave the model and makes it more difficult to over-fit the feature selection criterion, which is a significant practical problem (see my answer here, but see also the paper by Ambroise and McLachlan). Feature selection is a bit of a blunt instrument, the feature is either in the model, or it isn't. Regularisation is more refined.

In the appendix of his monograph on feature selection, Millar recommends that if you are primarily interested in generalisation performance (i.e. identifying the relevant features is not a primary goal of the analysis), don't use feature selection, use regularisation instead. And that is in a book on feature selection!

" With the correct features (that can't perfectly sperate the data), does w also become large?"

Yes. Consider a logistic regression problem with only one feature that is perfectly separable, but the gap between the "highest" negative pattern and the "lowest" positive pattern is very small. In this case the minimum of the cross-entropy loss is achieved if the output is either 0 or 1 for all patterns. To do so, the input to the logistic function needs to change very quickly from very negative to very positive, which requires the single weight to be very large. However if the problem is not separable, but you are using a non-linear model, and the data are relatively sparse, how do you tell the difference between a non-separable problem with a smooth decision boundary and a separable problem with a very wiggly decision boundary? Unfortunately the cross-entropy metric cannot make that distinction - if the model is flexible enough, the cross-entropy will be minimised by the very wiggly decision boundary. This is why regularisation of non-linear models tends to be necessary to prevent over-fitting.

References:

Christophe Ambroise and Geoffrey J. McLachlan, "Selection bias in gene extraction on the basis of microarray gene-expression data", PNAS, vol. 99, no. 10, pp. 6562–6566, 2002. (www)

Millar, A. (2002). Subset Selection in Regression, Second Edition. Chapman & Hall/CRC Monographs on Statistics & Applied Probability.