For a non-linearly separable problem, when there are enough features, we can make the data linearly separable. It seems to me that for logistic regression, the reason of overfitting is always excessive number of features. So why people mostly use l1, specially l2 regularization to shrink $w$ but not use feature selection? With the correct features (that can't perfectly sperate the data), does $w$ also become large?
Logistic Regression – Why Use Regularization Instead of Feature Selection?
feature selectionlogisticloss-functionsoverfittingregularization
Related Solutions
If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.
Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).
The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).
HTH
Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:
Biased estimator: erate = 0.429210 (0.397683 - 0.451737)
Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)
The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.
NF = 56;
NC = 259;
NFOLD = 10;
NMC = 1e+4;
% perform Monte-Carlo simulation of biased estimator
erate = zeros(NMC,1);
for i=1:NMC
y = randn(NC,1) >= 0;
x = randn(NC,NF) >= 0;
% perform feature selection
err = mean(repmat(y,1,NF) ~= x);
[err,idx] = min(err);
% perform cross-validation
partition = mod(1:NC, NFOLD)+1;
y_xval = zeros(size(y));
for j=1:NFOLD
y_xval(partition==j) = x(partition==j,idx(1));
end
erate(i) = mean(y_xval ~= y);
plot(erate);
drawnow;
end
erate = sort(erate);
fprintf(1, ' Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
% perform Monte-Carlo simulation of unbiased estimator
erate = zeros(NMC,1);
for i=1:NMC
y = randn(NC,1) >= 0;
x = randn(NC,NF) >= 0;
% perform cross-validation
partition = mod(1:NC, NFOLD)+1;
y_xval = zeros(size(y));
for j=1:NFOLD
% perform feature selection
err = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
[err,idx] = min(err);
y_xval(partition==j) = x(partition==j,idx(1));
end
erate(i) = mean(y_xval ~= y);
plot(erate);
drawnow;
end
erate = sort(erate);
fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
Feature selection sometimes improves the performance of regularized models, but in my experience it generally makes generalization performance worse. The reason for this is that the more choices we make regarding our model (including the values of the parameters, the choice of features, the setting of hyper-parameters, the choice of kernel...), the more data we need to make these choices reliably. Generally we make these choices by minimizing some criterion evaluated over a finite set of data, which means that the criterion inevitably has a non-zero variance. As a result, if we minimize the criterion too agressively, we can over-fit it, i.e. we can make choices that minimize the criterion because of features that depend on the particular sample on which it is evaluated, rather than because it will produce a genuine improvement in performance. I call this "over-fitting in model selection" to differentiate it from the more familiar form of over-fitting resulting from tuning the model parameters.
Now the SVM is an approximate implementation of a bound on generalization performance that does not depend on the dimensionality, so in principle, we can expect good performance without feature selection, provided the regularization parameters are correctly chosen. Most feature selection methods have no such performance "guarantees".
For L1 methods, I certainly wouldn't bother with feature selection, as the L1 criterion is generally effective in trimming features. The reason that it is effective is that it induces an ordering in which features enter and leave the model, which reduces the number of available choices in selecting features, and hence is less prone to over-fitting.
The best reason for feature selection is to find out which features are relevant/important. The worst reason for feature selection is to improve performance, for regularised models, generally it makes things worse. However, for some datasets, it can make a big difference, so the best thing to do is to try it and use a robust, unbiased performance evaluation scheme (e.g. nested cross-validation) to find out whether yours is one of those datasets.
Best Answer
Feature selection involves many degrees of freedom in minimisng the model/feature selection criterion, one binary degree of freedom for each feature. Regularisation on the other hand usually has only a single continuous degree of freedom. This imposes an implicit ordering in which features enter and leave the model and makes it more difficult to over-fit the feature selection criterion, which is a significant practical problem (see my answer here, but see also the paper by Ambroise and McLachlan). Feature selection is a bit of a blunt instrument, the feature is either in the model, or it isn't. Regularisation is more refined.
In the appendix of his monograph on feature selection, Millar recommends that if you are primarily interested in generalisation performance (i.e. identifying the relevant features is not a primary goal of the analysis), don't use feature selection, use regularisation instead. And that is in a book on feature selection!
" With the correct features (that can't perfectly sperate the data), does w also become large?"
Yes. Consider a logistic regression problem with only one feature that is perfectly separable, but the gap between the "highest" negative pattern and the "lowest" positive pattern is very small. In this case the minimum of the cross-entropy loss is achieved if the output is either 0 or 1 for all patterns. To do so, the input to the logistic function needs to change very quickly from very negative to very positive, which requires the single weight to be very large. However if the problem is not separable, but you are using a non-linear model, and the data are relatively sparse, how do you tell the difference between a non-separable problem with a smooth decision boundary and a separable problem with a very wiggly decision boundary? Unfortunately the cross-entropy metric cannot make that distinction - if the model is flexible enough, the cross-entropy will be minimised by the very wiggly decision boundary. This is why regularisation of non-linear models tends to be necessary to prevent over-fitting.
References:
Christophe Ambroise and Geoffrey J. McLachlan, "Selection bias in gene extraction on the basis of microarray gene-expression data", PNAS, vol. 99, no. 10, pp. 6562–6566, 2002. (www)
Millar, A. (2002). Subset Selection in Regression, Second Edition. Chapman & Hall/CRC Monographs on Statistics & Applied Probability.