Solved – How to interpret stable and overlapping learning curves

cross-validationmachine learningpythonscikit learnsvm

I have a training data size of about 80k.

I plotted a learning curve to check how much of the training sample is required to train the model. Although, after plotting my learning curve looks like this:

From How to know if a learning curve from SVM model suffers from bias or variance?, I came to know two main points:

If two curves are "close to each other" and both of them but have a low score. The model suffer from an under fitting problem (High Bias)

But both the curves have a high accuracy so, I am guessing it is not under-fitting

If training curve has a much better score but testing curve has a
lower score, i.e., there are large gaps between two curves. Then the
model suffer from an over fitting problem (High Variance)

It does not seem like a problem of over-fitting either.

1) What is can I infer from this graph? Is it normal to have the curves overlap each other?

2) What should I understand from this particular graph?

Edit: As suggested I have started the iteration from training sample of 0 - len(data).

Although the lines still overlap.

– I failed to mention that the data is highly skewed. 80-20 imbalance. So I am guessing the model just predicts everything to be the majority class and that is the reason the scores are high. I am not sure. Any suggestions?

@steffan: The training vector X, I have uploaded :Train Vector X and the respective target vector y at: Train target y as pickle files.

The code I have used is from the scikitlearn example:

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.01, 1.0, 5)):

    plt.figure(figsize = (13,9))
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

title = "Learning Curves (SVM, RBF kernel, $\gamma=0.001$)"
# SVC is more expensive so we do a lower number of CV iterations:
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = SVC(kernel = 'rbf', C=10000, gamma=0.001, class_weight='balanced')
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)
# X and y are the training vector and the target
plt.show()

The code is from here: SciKitlearn Example

I am not sure what score they are using in that code, I am sorry for my limited understanding here.

Best Answer

It is no surprise that the learning curve highly depends on the capabilities of the learner and on the structure of the data set and prediction power of its features.

It might be the case that there is only little variance in the combination of feature values (predictors) and labels (response). In this case even a small sample size can allow a capable learner to find all detectable patterns, resulting in a early high score. If not all patters can be detected, no perfect score can be achieved.

Since the training score is slightly above the cv-score for 10000 samples, I'd expect that there is an even greater difference for less than 10000 samples. So I suggest to test that. If the the overlap and score remains (even for a small number of examples, let's say << 1000), you should double check for an error (e.g. accidental row duplication, error in cv calculation (if you have done it yourself)).

Edit regarding class imbalance

The class distribution is

y
0    77623
1     5436

so guessing the majority class leads to an score of ~ 0.935, which is exactly what we see in the learning curve. The scoring function used in the learning curve is one provided by the estimator, which is Accuracy in case of SVC (see SVC documentation). The scoring function can be changed in learning curve (search for "scoring" in the learning_curve documentation, also needed to pass the parameter via plot_learning_curve).

Rerunning the code, but not going up to full train size (due to the time complexity of SVC)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
import pickle

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.01, 1.0, 5)):
    # calc curve first to avoid premature opening of figure
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    # now create the plot
    plt.figure(figsize = (13,9))
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")

    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt



y = np.array(pickle.load(open('path-to-test', 'rb')))

# https://stackoverflow.com/questions/11305790/pickle-incompatibility-of-numpy-arrays-between-python-2-and-3
X = None
with open('path-to-train', 'rb') as f:
    u = pickle._Unpickler(f)
    u.encoding = 'latin1'
    X = np.array(u.load())

gamma = 0.001
C = 10000
title = "Learning Curves (SVM, RBF kernel, $\gamma=" + str(gamma) + ", C=" + str(C) + "$)"
# SVC is more expensive so we do a lower number of CV iterations:
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
estimator = SVC(kernel = 'rbf', C=C, gamma=gamma, class_weight='balanced')
plot_learning_curve(estimator, title, X, y, cv=cv, n_jobs=6, train_sizes=np.linspace(0.015,0.15,5))
plt.show()

leads to this graph

so there might be a code issue, maybe at initial loading / preprocessing of X and y.

update: What is "wrong" with the the scikit-learn example?

First of all, nothing is wrong with nested cross validation here. Nested validation is of utmost importance for data-driven optimization, and cross validation is a very powerful approaches (particularly if iterated/repeated).

Then, whether anything is wrong at all depends on your point of view: as long as you do an honest nested validation (keeping the outer test data strictly independent), the outer validation is a proper measure of the "optimal" model's performance. Nothing wrong with that.

But several things can and do go wrong with grid search of these proportion-type performance measures for hyperparameter tuning of SVM. Basically they mean that you may (probably?) cannont rely on the optimization. Nevertheless, as long as your outer split was done properly, even if the model is not the best possible, you have an honest estimate of the performance of the model you got.

I'll try to give intuitive explanations why the optimization may be in trouble:

Mathematically/statisticaly speaking, the problem with the proportions is that measured proportions $\hat p$ are subject to a huge variance due to finite test sample size $n$ (depending also on the true performance of the model, $p$):
$Var (\hat p) = \frac{p (1 - p)}{n}$

You need ridiculously huge numbers of cases (at least compared to the numbers of cases I can usually have) in order to achieve the needed precision (bias/variance sense) for estimating recall, precision (machine learning performance sense). This of course applies also to ratios you calculate from such proportions. Have a look at the confidence intervals for binomial proportions. They are shockingly large! Often larger than the true improvement in performance over the hyperparameter grid. And statistically speaking, grid search is a massive multiple comparison problem: the more points of the grid you evaluate, the higher the risk of finding some combination of hyperparameters that accidentally looks very good for the train/test split you are evaluating. This is what I mean with skimming variance. The well known optimistic bias of the inner (optimization) validation is just a symptom of this variance skimming.
Intuitively, consider a hypothetical change of a hyperparameter, that slowly causes the model to deteriorate: one test case moves towards the decision boundary. The 'hard' proportion performance measures do not detect this until the case crosses the border and is on the wrong side. Then, however, they immediately assign a full error for an infinitely small change in the hyperparameter.
In order to do numerical optimization, you need the performance measure to be well behaved. That means: neither the jumpy (not continously differentiable) part of the proportion-type performance measure nor the fact that other than that jump, actually occuring changes are not detected are suitable for the optimization.
Proper scoring rules are defined in a way that is particularly suitable for optimization. They have their global maximum when the predicted probabilities match the true probabilities for each case to belong to the class in question.
For SVMs you have the additional problem that not only the performance measures but also the model reacts in this jumpy fashion: small changes of the hyperparameter will not change anything. The model changes only when the hyperparameters are changes enough to cause some case to either stop being support vector or to become support vector. Again, such models are hard to optimize.

Literature:

Brown, L.; Cai, T. & DasGupta, A.: Interval Estimation for a Binomial Proportion, Statistical Science, 16, 101-133 (2001).
Cawley, G. C. & Talbot, N. L. C.: On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, Journal of Machine Learning Research, 11, 2079-2107 (2010).
Gneiting, T. & Raftery, A. E.: Strictly Proper Scoring Rules, Prediction, and Estimation, Journal of the American Statistical Association, 102, 359-378 (2007). DOI: 10.1198/016214506000001437
Brereton, R.: Chemometrics for pattern recognition, Wiley, (2009).
points out the jumpy behaviour of the SVM as function of the hyperparameters.

Update II: Skimming variance

what you can afford in terms of model comparison obviously depends on the number of independent cases. Let's make some quick and dirty simulation about the risk of skimming variance here:

scikit.learn says that they have 1797 are in the digits data.

assume that 100 models are compared, e.g. a $10 \times 10$ grid for 2 parameters.
assume that both parameter (ranges) do not affect the models at all,
i.e., all models have the same true performance of, say, 97 % (typical performance for the digits data set).

Run $10^4$ simulations of "testing these models" with sample size = 1797 rows in the digits data set

p.true = 0.97 # hypothetical true performance for all models
n.models = 100 # 10 x 10 grid

n.rows = 1797 # rows in scikit digits data

sim.test <- replicate (expr= rbinom (n= nmodels, size= n.rows, prob= p.true), 
                       n = 1e4)
sim.test <- colMaxs (sim.test) # take best model

hist (sim.test / n.rows, 
      breaks = (round (p.true * n.rows) : n.rows) / n.rows + 1 / 2 / n.rows, 
      col = "black", main = 'Distribution max. observed performance',
      xlab = "max. observed performance", ylab = "n runs")
abline (v = p.outer, col = "red")

Here's the distribution for the best observed performance:

skimming variance simulation

The red line marks the true performance of all our hypothetical models. On average, we observe only 2/3 of the true error rate for the seemingly best of the 100 compared models (for the simulation we know that they all perform equally with 97% correct predictions).

This simulation is obviously very much simplified:

In addition to the test sample size variance there is at least the variance due to model instability, so we're underestimating the variance here
Tuning parameters affecting the model complexity will typically cover parameter sets where the models are unstable and thus have high variance.
For the UCI digits from the example, the original data base has ca. 11000 digits written by 44 persons. What if the data is clustered according to the person who wrote? (I.e. is it easier to recognize an 8 written by someone if you know how that person writes, say, a 3?) The effective sample size then may be as low as 44.
Tuning model hyperparameters may lead to correlation between the models (in fact, that would be considered well behaved from a numerical optimization perspective). It is difficult to predict the influence of that (and I suspect this is impossible without taking into account the actual type of classifier).

In general, however, both low number of independent test cases and high number of compared models increase the bias. Also, the Cawley and Talbot paper gives empirical observed behaviour.

Solved – Plotting learning curves for any classification algorithm

In fact, you can define your own error function and pass it to the validation_curve() function as so:

def rms_error(model, X, y):
    y_pred = model.predict(X)
    return np.sqrt(np.mean((y - y_pred) ** 2))

val_train, val_test = validation_curve(PolynomialRegression(), X, y,
                                       'polynomialfeatures__degree',
                                       degree, cv=7, scoring=rms_error)

Best Answer

Related Solutions

Cross Validation – How to Split the Dataset for Learning Curve and Final Evaluation

update: What is "wrong" with the the scikit-learn example?

Update II: Skimming variance

Solved – Plotting learning curves for any classification algorithm