Solved – Having trouble understanding cross-validation results from scikit-learn

machine learningpythonscikit learnsvm

Actually, my question may just be about cross-validation in general. Here's what I'm doing: I'm trying to come up with a model using scikit-learn to learn on some data I've got. I've decided to use an SVM, using various kernels, to do the modelling. I've got about 50,000 data points from which to extract features. In an effort to make sure that my model is not over- or under-fitting, I've decided to run all of my models through cross-validation using scikit-learn's cross_validation functionality. I'm setting aside 40% of my training data for cross-validation, and so training on 60%.

I do this iteratively until I come up with a set of features and a model that gives me a cross-validation score of about 0.96. Great! Here's the problem – when I use this model to predict results for my test data, I only get a score of about 0.79! I don't understand that result. My question is, am I misunderstanding the cross validation score? Shouldn't I be able to expect similar results for my test data when using the model cross-validated to 0.96? I even used the GridSearchCV to come up with the best parameters to use for the SVM kernel. I also made sure to train on the full set of training data when training my model before running predict.

This is my first real attempt to use machine learning for a cool project, and I'm totally confused on my expectations.

Best Answer

From section 7.10.2 of Elements of Statistical Learning(free online, and it's great):

Consider a classification problem with a large number of predictors, as may arise, for example, in genomic or proteomic applications. A typical strategy for analysis might be as follows:

  1. Screen the predictors: find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels
  2. Using just this subset of predictors, build a multivariate classifier.
  3. Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model.

Is this a correct application of cross-validation? Consider a scenario with N = 50 samples in two equal-sized classes, and p = 5000 quantitative predictors (standard Gaussian) that are independent of the class labels. The true (test) error rate of any classifier is 50%. We carried out the above recipe, choosing in step (1) the 100 predictors having highest correlation with the class labels, and then using a 1-nearest neighbor classifier, based on just these 100 predictors, in step (2). Over 50 simulations from this setting, the average CV error rate was 3%. This is far lower than the true error rate of 50%.

What has happened? The problem is that the predictors have an unfair advantage, as they were chosen in step (1) on the basis of all of the samples. Leaving samples out after the variables have been selected does not cor-rectly mimic the application of the classifier to a completely independent test set, since these predictors “have already seen” the left out samples.

We selected the 100 predictors having largest correlation with the class labels over all 50 samples. Then we chose a random set of 10 samples, as we would do in five-fold cross-validation, and computed the correlations of the pre-selected 100 predictors with the class labels over just these 10 samples (top panel). We see that the correlations average about 0.28, rather than 0, as one might expect