Does feature selection help improve the performance of machine learning?
For example, suppose I have a n>>p data set, does it help to select important variable before fitting a XGBoost model?
boostingfeature selection
Does feature selection help improve the performance of machine learning?
For example, suppose I have a n>>p data set, does it help to select important variable before fitting a XGBoost model?
Whether you use LOO or K-fold CV, you'll end up with different features since the cross-validation iteration must be the most outer loop, as you said. You can think of some kind of voting scheme which would rate the n-vectors of features you got from your LOO-CV (can't remember the paper but it is worth checking the work of Harald Binder or Antoine Cornuéjols). In the absence of a new test sample, what is usually done is to re-apply the ML algorithm to the whole sample once you have found its optimal cross-validated parameters. But proceeding this way, you cannot ensure that there is no overfitting (since the sample was already used for model optimization).
Or, alternatively, you can use embedded methods which provide you with features ranking through a measure of variable importance, e.g. like in Random Forests (RF). As cross-validation is included in RFs, you don't have to worry about the $n\ll p$ case or curse of dimensionality. Here are nice papers of their applications in gene expression studies:
Since you are talking of SVM, you can look for penalized SVM.
If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.
Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).
The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).
HTH
Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:
Biased estimator: erate = 0.429210 (0.397683 - 0.451737)
Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)
The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.
NF = 56;
NC = 259;
NFOLD = 10;
NMC = 1e+4;
% perform Monte-Carlo simulation of biased estimator
erate = zeros(NMC,1);
for i=1:NMC
y = randn(NC,1) >= 0;
x = randn(NC,NF) >= 0;
% perform feature selection
err = mean(repmat(y,1,NF) ~= x);
[err,idx] = min(err);
% perform cross-validation
partition = mod(1:NC, NFOLD)+1;
y_xval = zeros(size(y));
for j=1:NFOLD
y_xval(partition==j) = x(partition==j,idx(1));
end
erate(i) = mean(y_xval ~= y);
plot(erate);
drawnow;
end
erate = sort(erate);
fprintf(1, ' Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
% perform Monte-Carlo simulation of unbiased estimator
erate = zeros(NMC,1);
for i=1:NMC
y = randn(NC,1) >= 0;
x = randn(NC,NF) >= 0;
% perform cross-validation
partition = mod(1:NC, NFOLD)+1;
y_xval = zeros(size(y));
for j=1:NFOLD
% perform feature selection
err = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
[err,idx] = min(err);
y_xval(partition==j) = x(partition==j,idx(1));
end
erate(i) = mean(y_xval ~= y);
plot(erate);
drawnow;
end
erate = sort(erate);
fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
Best Answer
You should not have any variables that you feel would obviously not be influencing the dependent variable at all, that is have only a large pool of variables that you have a hypothesis around impacting the dependent variable; you wouldn't want your model to learn noise from variables that have no logical sense in being part of the independent variable space but have spurious correlations with other vars. But apart from those obvious exclusions, the point is, how would you know which features/variables are important and which are not? You may think a certain variable will not be of much importance but when you actually fit a model, it may come up as having much more discriminatory power than you'd thought!
In tree based ensemble methods, such as XGBoost, each variable is evaluated as a potential splitting variable, which makes them robust to unimportant/irrelevant variables, because such variables that cannot discriminate between events/non-events will not be selected as the splitting variable and hence will be very low on the var importance graph as well. However, a caveat here is that if you have two (or more) highly correlated variables, the importance that you get for these may not be indicative of their actual importance (though even this doesn't affect your model's predictive performance). So you may leave all your features in and run a few iterations to see how important/not they are and the ones that consistently lie at the bottom of the var imp chart can be excluded from subsequent runs to improve computational performance.