Solved – 100% training accuracy despite a low cv score

auccross-validationradial basisrocsvm

I am working on an assignment where we have to study the affect of gamma and C parameters on SVM with RBF kernel. I use python's sklearn library and grid search with 10 fold cross validation (with a test size of .2) to test different values of gamma. For each value of gamma I also compute mean-validation-score( the mean of 10 fold cv accuracy scores) and accuracy and AUC of the ROC curve on the training data. For a particular value i get a poor mean-validation-score of .5 but a 100% accuracy and AUC of 1. I am not able to interpret these values. I know 100% accuracy on training data indicates over-fitting, but I cannot explain/reconcile such a huge difference between AUC and cv score. Is the over-fitting so extreme that it causes the cv score to be really poor

The different classifiers are shown below. The classifier # 5 is the one with cv score of .5 and accuracy of 1. I am not sure how to interpret the graph as it shows the entire graph colored as red, but still gives an accuracy of 1. But shouldn't the accuracy be around .5 since , half the samples (blue) are missclassified.

Graph (5) AUC =1, cv-score =.5.

The relevant code is :

def run():
    X,y = create_data(1000,2)

    gamma_range = np.logspace(start=-15,stop= 15,num =4, base=2)
    gamma_range = np.append(gamma_range,[2**30,2**(-30)])
    gamma_grid = dict(gamma=gamma_range)
    print(gamma_grid)

    C_range = np.logspace(start=-15,stop= 15,num =4, base=2)
    #C_range = np.append(C_range,[2**20,2**-20])
    C_grid = dict(C=C_range)
    print(C_grid)

    plt.figure(figsize=(10, 7))
    h=.2
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))

    grid = GridSearchCV(SVC(kernel="rbf"), param_grid=gamma_grid, cv=10)
    grid.fit(X,y)
    grid_scores = grid.grid_scores_
    print("The best parameters are %s with a score of %0.2f"% (grid.best_params_, grid.best_score_))
    print(grid_scores)

    for i, score in enumerate(grid_scores):

        gamma = score.parameters['gamma']
        clf = SVC(kernel="rbf",gamma=gamma)
        clf.fit(X,y)
        Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        y_predicted = clf.predict(X)
        print(score.cv_validation_scores)
        print("(%d) γ=2^%d, C=%s, CV-Score = %.3f, accuracy=%.2f, AUC = %.3f" %(i+1,np.log2(gamma), "Default",score.mean_validation_score,accuracy_score(y,y_predicted),roc_auc_score(y,y_predicted)))

        # visualize decision function for these parameters
        plt.subplot(3, 4, i+1)
        plt.title("(%d) γ=2^%d C=%s CV-Score = %.3f AUC = %.3f"  % (i+1,np.log2(gamma), "Default",score.mean_validation_score,roc_auc_score(y,y_predicted)), size='medium')

        # visualize parameter's effect on decision function
        plt.pcolormesh(xx, yy, -Z, cmap=plt.cm.RdBu)
        plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdBu_r)
        plt.xticks(())
        plt.yticks(())
        plt.axis('tight')

Best Answer

Reviewing your code, there's a couple things that you might consider trying.

  • you are not setting the C values, thus sklearn will use a default value of C = 1. This will not necessarily mean that you are overfitting, it depends on your data, but potentially it could

(You create a dictionary for C_grid, but you don't pass it to GridSearchCV; you should combine your 2 dictionaries C_grid and gamma_grid into one, an pass it as param_grid)

  • in the loop, when fitting for the full data, you are using again the default value of C. When you are calculating the AUC score keep in mind you are doing it for the full data and not really selecting the best score from CV, thus again potentially overfitting

In conclusion, I would suggest (1) you don't use default C values but a grid, and (2) when looping for the different gammas to calculate AUC, use the best C value.

Hope it helps.