Solved – Having trouble understanding cross-validation results from scikit-learn

machine learningpythonscikit learnsvm

Actually, my question may just be about cross-validation in general. Here's what I'm doing: I'm trying to come up with a model using scikit-learn to learn on some data I've got. I've decided to use an SVM, using various kernels, to do the modelling. I've got about 50,000 data points from which to extract features. In an effort to make sure that my model is not over- or under-fitting, I've decided to run all of my models through cross-validation using scikit-learn's cross_validation functionality. I'm setting aside 40% of my training data for cross-validation, and so training on 60%.

I do this iteratively until I come up with a set of features and a model that gives me a cross-validation score of about 0.96. Great! Here's the problem – when I use this model to predict results for my test data, I only get a score of about 0.79! I don't understand that result. My question is, am I misunderstanding the cross validation score? Shouldn't I be able to expect similar results for my test data when using the model cross-validated to 0.96? I even used the GridSearchCV to come up with the best parameters to use for the SVM kernel. I also made sure to train on the full set of training data when training my model before running predict.

This is my first real attempt to use machine learning for a cool project, and I'm totally confused on my expectations.

Best Answer

From section 7.10.2 of Elements of Statistical Learning(free online, and it's great):

Consider a classification problem with a large number of predictors, as may arise, for example, in genomic or proteomic applications. A typical strategy for analysis might be as follows:

Screen the predictors: find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels
Using just this subset of predictors, build a multivariate classifier.
Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model.

Is this a correct application of cross-validation? Consider a scenario with N = 50 samples in two equal-sized classes, and p = 5000 quantitative predictors (standard Gaussian) that are independent of the class labels. The true (test) error rate of any classifier is 50%. We carried out the above recipe, choosing in step (1) the 100 predictors having highest correlation with the class labels, and then using a 1-nearest neighbor classifier, based on just these 100 predictors, in step (2). Over 50 simulations from this setting, the average CV error rate was 3%. This is far lower than the true error rate of 50%.

What has happened? The problem is that the predictors have an unfair advantage, as they were chosen in step (1) on the basis of all of the samples. Leaving samples out after the variables have been selected does not cor-rectly mimic the application of the classifier to a completely independent test set, since these predictors “have already seen” the left out samples.

We selected the 100 predictors having largest correlation with the class labels over all 50 samples. Then we chose a random set of 10 samples, as we would do in five-fold cross-validation, and computed the correlations of the pre-selected 100 predictors with the class labels over just these 10 samples (top panel). We see that the correlations average about 0.28, rather than 0, as one might expect

Related Solutions

Solved – How to use scikit-learn’s cross validation functions on multi-label classifiers

Stratified sampling means that the class membership distribution is preserved in your KFold sampling. This doesn't make a lot of sense in the multilabel case where your target vector might have more than one label per observation.

There are two possible interpretations of stratified in this sense.

For $n$ labels where at least one of them is filled that gives you $\sum\limits_{i=1}^n2^n$ unique labels. You could perform stratified sampling on the each of the unique label bins.

The other option is to try and segment the training data s.t. that probability mass of the distribution of the label vectors is approximately the same over the folds. E.g.

import numpy as np

np.random.seed(1)
y = np.random.randint(0, 2, (5000, 5))
y = y[np.where(y.sum(axis=1) != 0)[0]]


def proba_mass_split(y, folds=7):
    obs, classes = y.shape
    dist = y.sum(axis=0).astype('float')
    dist /= dist.sum()
    index_list = []
    fold_dist = np.zeros((folds, classes), dtype='float')
    for _ in xrange(folds):
        index_list.append([])
    for i in xrange(obs):
        if i < folds:
            target_fold = i
        else:
            normed_folds = fold_dist.T / fold_dist.sum(axis=1)
            how_off = normed_folds.T - dist
            target_fold = np.argmin(np.dot((y[i] - .5).reshape(1, -1), how_off.T))
        fold_dist[target_fold] += y[i]
        index_list[target_fold].append(i)
    print("Fold distributions are")
    print(fold_dist)
    return index_list

if __name__ == '__main__':
    proba_mass_split(y)

To get the normal training, testing indices that KFold produces you want to rewrite that to it returns the np.setdiff1d of each index with np.arange(y.shape[0]), then wrap that in a class with an iter method.

Solved – Use of nested cross-validation

Nested cross-validation is used to avoid optimistically biased estimates of performance that result from using the same cross-validation to set the values of the hyper-parameters of the model (e.g. the regularisation parameter, $C$, and kernel parameters of an SVM) and performance estimation. I wrote a paper on this topic after being rather alarmed by the magnitude of the bias introduced by a seemingly benign short cut often used in the evaluation of kernel machines. I investigated this topic in order to discover why my results were worse than other research groups using similar methods on the same datasets, the reason turned out to be that I was using nested cross-validation and hence didn't benefit from the optimistic bias.

G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (http://jmlr.org/papers/volume11/cawley10a/cawley10a.pdf)

The reasons for the bias with illustrative examples and experimental evaluation can be found in the paper, but essentially the point is that if the performance evaluation criterion is used in any way to make choices about the model, then those choices are based on (i) genuine improvements in generalisation performance and (ii) the statistical peculiarities of the particular sample of data on which the performance evaluation criterion is evaluated. In other words, the bias arises because it is possible (all too easy) to over-fit the cross-validation error when tuning the hyper-parameters.

Best Answer

Related Solutions

Solved – How to use scikit-learn’s cross validation functions on multi-label classifiers

Solved – Use of nested cross-validation

Related Question