Solved – Nested cross-validation and feature selection: when to perform the feature selection

cross-validationfeature selectionmachine learning

I am trying to predict a behavioral variable using neuroimaging data using supporting vector regression.
Since there are ~ 400.000 voxels (=features) in an image and I have a limited sample size I have decided to perform a features selection step. In particular I calculate the univariate correlation between each feature and the dependent variable N times with sample N-1 and I take the lowest estimate of the correlation in order to select only those feature who are stably (across subject) associated with the dependent variable.

In order to select the hyper parameters of the SVR (v and C) I am performing a nested cross validation.

Right now, the whole process looks like this

For every subject in N
    Take N-1 sample 
    Perform features selection on N-1
    For every combination of hyper parameters
        For every subject in N-1
            Fit the model on N-1-1
            Test the model on the inner left out subject               
        Chose the best combination of hyper parameters
    Fit a model on N-1 using the best combination
    Test the model on the outer left out subject

What I am wondering now is about the feature selection. Is it correct to perform it only one time before the inner loop for the cross-validation of the hyper parameters, or should it be performed within the inner loop, together with the choice of the hyper parameters ?
From one point of view, the feature selection is indeed independent from the test sample, but on the other hand it is not cross-validated for the inner sample.

Any take on this ?

Best Answer

Usually, performing feature selection inside the inner loop would be the safer option.

Think about if you are able to tune your feature selection with certain parameters too - like the amount of correlation you allow, information you preserve, or similar. If you want to optimize those, not doing so in the inner loop would likely leave you with an overly optimistic error estimate (as you don't have a separate inner-loop performance estimation anymore). Therefore doing such things in the inner loop and using the outer loop for the final error estimation would usually be the way to go.

Update: I tried to sketch a workflow that I think should be applicable for your problem, in as few steps as possible (see below, I hope I didn't mess anything up). If you want to check out more details, consider reading one of those papers:

Varma & Simon (2006). "Bias in error estimation when using cross-validation for model selection." BMC Bioinformatics, 7: 91

Cawley & Talbot (2010). "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation." Journal of Machine Learning Research, 11: 2079-2107

Do data partitioning (train subjects/test subjects)
Do e.g. repeated CV (leave-subject-out-CV) on train data: For every subject in N:
    Leave out N1
        Leave out N2
            Fit all combinations of feature selection parametrization, hyperparameters, etc. on N-N1-N2
            Evaluate and remember performance for all on the inner left out subject N2
            Evaluate and remember performance for all on the outer left out subject N1
Select "best" parametrization from performance of leaving out N2
Report CV model performance from performance of leaving out N1
Train final model from all training data using chosen "best" parametrization
Test final model - double check that model does what it should
Related Question