Solved – Plotting learning curves for any classification algorithm

machine learningrandom forestscikit learnsvm

As recommended by Andrew Ng in his great course on machine learning, I would like to plot the learning curves for experiments I am running with Random Forest and SVM algorithms.

The learning curves are computed as the cost minimized during the training vs the number of samples for the training and the testing sets and allow to detect high variance or high bias problems.

I'm using scikit-learn and I'm aware of sklearn.learning_curve.learning_curve, but it computes the classification scores for different training set sizes and I'm wondering whether it is the same as using the cost.

Is using the classification score the correct way to plot the learning curve for a classification process in order to diagnose high variance or bias? Or is there any cost I could use?

Best Answer

In fact, you can define your own error function and pass it to the validation_curve() function as so:

def rms_error(model, X, y):
    y_pred = model.predict(X)
    return np.sqrt(np.mean((y - y_pred) ** 2))

val_train, val_test = validation_curve(PolynomialRegression(), X, y,
                                       'polynomialfeatures__degree',
                                       degree, cv=7, scoring=rms_error)