Solved – SVC doing great on validation & test data but scored very low on predicted data

cross-validationkagglemachine learningscoring-rulessvm

First of all, this is my first machine learning project after taking Andrew Ng's course, so please bear with me.

I'm working on the most famous dataset, the Titanic data.

First, I split the dataset to training and testing set :

training, testing = train_test_split(train, test_size=0.2, stratify=train['Survived'], random_state=0)

X_train = training
X_train = X_train.drop(['Survived'], axis=1)
y_train = training['Survived']

X_test = testing
X_test = X_test.drop(['Survived'], axis=1)
y_test = testing['Survived']

The default SVC works poorly on this dataset because of overfitting (90% accuracy on training set but 60% on CV set)

So I do nested CV (GridSearchCV + cross_val_score) to find a good hyperparameters : C and gamma. Note that I use the default rbf kernel.

First, I tried smaller values for C (larger margin) and larger values for gamma because theoretically it will reduce overfitting.

However, I noticed GridSearchCV tend to pick the largest C and smallest gamma as the best parameter. This is my complete code (after data cleansing & feature engineering) :

parameters = {
                'C': [2000, 2500, 3000], # makin kecil, makin besar margin
                'gamma': [0.000001, 0.000003, 0.000006],
                'random_state': [0]
             }

clf = SVC()

grid_obj = GridSearchCV(clf, parameters, cv=5, scoring='accuracy')
grid_obj = grid_obj.fit(X_train, y_train) # pake ini?

scores_log = cross_val_score(grid_obj, X_train, y_train, cv=10)
print('Final CV accuracy: %.3f +/- %.3f' % (np.mean(scores_log), np.std(scores_log)))

print(grid_obj.best_estimator_)
print('Best GridSearchCV Score : ' + str(grid_obj.best_score_))

# Set the clf to the best combination of parameters
clf = grid_obj.best_estimator_

# Fit the best algorithm to the data.
clf.fit(X_train, y_train)

score_train = clf.score(X_train, y_train)
print('Training Accuracy : ' + str(score_train))

score_test = clf.score(X_test, y_test)
print('Test Accuracy : ' + str(score_test))

SVC is slow (and my laptop is not that great haha). Almost two hours passed by, and I arrived at a very extreme parameter. I took those (supposedly) best parameter and train the Classifier with all of my data (including the test set) :

X = pd.concat([X_train,X_test])
y = pd.concat([y_train,y_test])

parameters = {'C':3000,
              'gamma':0.000006,
              'random_state':0}

clf = SVC(**parameters)
clf.fit(X_train, y_train)
score = clf.score(X_train, y_train)

print('Accuracy : ' + str(score))

y_pred = clf.predict(test)

submit_kaggle(test.loc[:,'PassengerId'], y_pred)

With those best parameters, the SVC scored +-80% in all training, CV, and test data. I believed I have decreased the overfitting because a higher test score and lower training score (compared to 90% accuracy with default parameter).

Finally, I submit the prediction to Kaggle…and I got 51% score.

What confuse me the most is the gap between the test score and Kaggle score.

I think I do something wrong somewhere (probably letting my Classifier train on the testing set).

Please kindly take a look at my code, and just let me know if you want to check more code (the data cleansing & feature engineering part).

Thanks in advance

Notes : I have tried Linear Regression and Decision Tree using the same structure as the above's code and its working as expected (the accuracy of test set is similar with Kaggle score)

Best Answer

Here are a few tips you may want to try:

  1. Standardize your dataset
  2. Perform some EDA to get an idea of underlying patterns in data
  3. Do some visualization to see data regularities
  4. Do dimensionality reduction
  5. Now, again try your classifier
  6. If RBF gives poor result, try a brand new SVM kernel method, called `CJSD Kernels' which gives improved classification accuracies over RBF. Here is the description about CJSDs: https://www.quora.com/Is-the-Jensen-Shannon-Divergence-limited-in-0-1-Given-two-models-is-it-correct-that-the-larger-JSD-is-the-more-similar-they-are-to-each-other

and here are the references for CJSD Kernels:

https://ieeexplore.ieee.org/document/7424294/ https://ieeexplore.ieee.org/document/7796903/ https://link.springer.com/article/10.1007%2Fs41060-017-0054-1

The following paper will tell you the limitations of traditional SVM kernel methods, and why it lead to the development of CJSD based kernels:

"Investigating Manifold Neighborhood size for Nonlinear analysis of LIBS Amino Acid Spectra"

See if you can get hold on the following PhD Dissertation:

"Finding a Suitable Model For Novel Data Using Range Transformation"

as the above work is a subset of this broader research work.

Related Question