Solved – learning Curve Sklearn

machine learningpythonscikit learn

Why learning curve of scikit-learn (example here) is different from the one which was
taught by andrew NG in his machine learning course at Coursera

as per Andrew NG, Training Error should start from zero & increases with training size as it generalizes the Data. The Validation error decreases with training size. But as per scikit-learn it gives us opposite plot i.e., training error decreases and validation error increases.
Please Explain?

Best Answer

It depends on what is shown on the y-axis: score or error.

The scikit-learn example you linked to shows the score (mean accuracy in the case of Naive Bayes and SVM) as a function of the number of training examples. As you would expect, as the number of training examples grows, training accuracy decreases and CV accuracy increases.

If you were to plot the error on the y axis instead, you get the opposite behaviour: training error increases and CV error decreases as the number of training examples grows.

I suspect the learning curve used in the ML course shows the error (MSE if I remember correctly) as a function of the number of training examples. Hence the different plots.

Related Solutions

Solved – Learning curve shows decreasing accuracy

I don't know if you split your dataset randomly (each sample receives a random subset of observations) or not. I assume that your split is random.

why does the training error start so high, then suddenly drop, then start to rise again as training set size increases?

This is merely a noise caused by small size of your training and test sets, as well as the random nature of the random forest model.

As for the gap between the training and test set, such a gap is common for a model as complex as random forest, and the data of only 800 training examples (how many explanatory variables you have, by the way?)

Solved – Plotting learning curves for any classification algorithm

In fact, you can define your own error function and pass it to the validation_curve() function as so:

def rms_error(model, X, y):
    y_pred = model.predict(X)
    return np.sqrt(np.mean((y - y_pred) ** 2))

val_train, val_test = validation_curve(PolynomialRegression(), X, y,
                                       'polynomialfeatures__degree',
                                       degree, cv=7, scoring=rms_error)

Best Answer

Related Solutions

Solved – Learning curve shows decreasing accuracy

Solved – Plotting learning curves for any classification algorithm

Related Question