I am working on an assignment where we have to study the affect of gamma and C parameters on SVM with RBF kernel. I use python's sklearn library and grid search with 10 fold cross validation (with a test size of .2) to test different values of gamma. For each value of gamma I also compute mean-validation-score( the mean of 10 fold cv accuracy scores) and accuracy and AUC of the ROC curve on the training data. For a particular value i get a poor mean-validation-score of .5 but a 100% accuracy and AUC of 1. I am not able to interpret these values. I know 100% accuracy on training data indicates over-fitting, but I cannot explain/reconcile such a huge difference between AUC and cv score. Is the over-fitting so extreme that it causes the cv score to be really poor
The different classifiers are shown below. The classifier # 5 is the one with cv score of .5 and accuracy of 1. I am not sure how to interpret the graph as it shows the entire graph colored as red, but still gives an accuracy of 1. But shouldn't the accuracy be around .5 since , half the samples (blue) are missclassified.
The relevant code is :
def run():
X,y = create_data(1000,2)
gamma_range = np.logspace(start=-15,stop= 15,num =4, base=2)
gamma_range = np.append(gamma_range,[2**30,2**(-30)])
gamma_grid = dict(gamma=gamma_range)
print(gamma_grid)
C_range = np.logspace(start=-15,stop= 15,num =4, base=2)
#C_range = np.append(C_range,[2**20,2**-20])
C_grid = dict(C=C_range)
print(C_grid)
plt.figure(figsize=(10, 7))
h=.2
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))
grid = GridSearchCV(SVC(kernel="rbf"), param_grid=gamma_grid, cv=10)
grid.fit(X,y)
grid_scores = grid.grid_scores_
print("The best parameters are %s with a score of %0.2f"% (grid.best_params_, grid.best_score_))
print(grid_scores)
for i, score in enumerate(grid_scores):
gamma = score.parameters['gamma']
clf = SVC(kernel="rbf",gamma=gamma)
clf.fit(X,y)
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
y_predicted = clf.predict(X)
print(score.cv_validation_scores)
print("(%d) γ=2^%d, C=%s, CV-Score = %.3f, accuracy=%.2f, AUC = %.3f" %(i+1,np.log2(gamma), "Default",score.mean_validation_score,accuracy_score(y,y_predicted),roc_auc_score(y,y_predicted)))
# visualize decision function for these parameters
plt.subplot(3, 4, i+1)
plt.title("(%d) γ=2^%d C=%s CV-Score = %.3f AUC = %.3f" % (i+1,np.log2(gamma), "Default",score.mean_validation_score,roc_auc_score(y,y_predicted)), size='medium')
# visualize parameter's effect on decision function
plt.pcolormesh(xx, yy, -Z, cmap=plt.cm.RdBu)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdBu_r)
plt.xticks(())
plt.yticks(())
plt.axis('tight')
Best Answer
Reviewing your code, there's a couple things that you might consider trying.
sklearn
will use a default value of C = 1. This will not necessarily mean that you are overfitting, it depends on your data, but potentially it could(You create a dictionary for
C_grid
, but you don't pass it toGridSearchCV
; you should combine your 2 dictionariesC_grid
andgamma_grid
into one, an pass it asparam_grid
)In conclusion, I would suggest (1) you don't use default C values but a grid, and (2) when looping for the different gammas to calculate AUC, use the best C value.
Hope it helps.