Solved – SVC doing great on validation & test data but scored very low on predicted data

cross-validationkagglemachine learningscoring-rulessvm

First of all, this is my first machine learning project after taking Andrew Ng's course, so please bear with me.

I'm working on the most famous dataset, the Titanic data.

First, I split the dataset to training and testing set :

training, testing = train_test_split(train, test_size=0.2, stratify=train['Survived'], random_state=0)

X_train = training
X_train = X_train.drop(['Survived'], axis=1)
y_train = training['Survived']

X_test = testing
X_test = X_test.drop(['Survived'], axis=1)
y_test = testing['Survived']

The default SVC works poorly on this dataset because of overfitting (90% accuracy on training set but 60% on CV set)

So I do nested CV (GridSearchCV + cross_val_score) to find a good hyperparameters : C and gamma. Note that I use the default rbf kernel.

First, I tried smaller values for C (larger margin) and larger values for gamma because theoretically it will reduce overfitting.

However, I noticed GridSearchCV tend to pick the largest C and smallest gamma as the best parameter. This is my complete code (after data cleansing & feature engineering) :

parameters = {
                'C': [2000, 2500, 3000], # makin kecil, makin besar margin
                'gamma': [0.000001, 0.000003, 0.000006],
                'random_state': [0]
             }

clf = SVC()

grid_obj = GridSearchCV(clf, parameters, cv=5, scoring='accuracy')
grid_obj = grid_obj.fit(X_train, y_train) # pake ini?

scores_log = cross_val_score(grid_obj, X_train, y_train, cv=10)
print('Final CV accuracy: %.3f +/- %.3f' % (np.mean(scores_log), np.std(scores_log)))

print(grid_obj.best_estimator_)
print('Best GridSearchCV Score : ' + str(grid_obj.best_score_))

# Set the clf to the best combination of parameters
clf = grid_obj.best_estimator_

# Fit the best algorithm to the data.
clf.fit(X_train, y_train)

score_train = clf.score(X_train, y_train)
print('Training Accuracy : ' + str(score_train))

score_test = clf.score(X_test, y_test)
print('Test Accuracy : ' + str(score_test))

SVC is slow (and my laptop is not that great haha). Almost two hours passed by, and I arrived at a very extreme parameter. I took those (supposedly) best parameter and train the Classifier with all of my data (including the test set) :

X = pd.concat([X_train,X_test])
y = pd.concat([y_train,y_test])

parameters = {'C':3000,
              'gamma':0.000006,
              'random_state':0}

clf = SVC(**parameters)
clf.fit(X_train, y_train)
score = clf.score(X_train, y_train)

print('Accuracy : ' + str(score))

y_pred = clf.predict(test)

submit_kaggle(test.loc[:,'PassengerId'], y_pred)

With those best parameters, the SVC scored +-80% in all training, CV, and test data. I believed I have decreased the overfitting because a higher test score and lower training score (compared to 90% accuracy with default parameter).

Finally, I submit the prediction to Kaggle…and I got 51% score.

What confuse me the most is the gap between the test score and Kaggle score.

I think I do something wrong somewhere (probably letting my Classifier train on the testing set).

Please kindly take a look at my code, and just let me know if you want to check more code (the data cleansing & feature engineering part).

Thanks in advance

Notes : I have tried Linear Regression and Decision Tree using the same structure as the above's code and its working as expected (the accuracy of test set is similar with Kaggle score)

Best Answer

Here are a few tips you may want to try:

Standardize your dataset
Perform some EDA to get an idea of underlying patterns in data
Do some visualization to see data regularities
Do dimensionality reduction
Now, again try your classifier
If RBF gives poor result, try a brand new SVM kernel method, called `CJSD Kernels' which gives improved classification accuracies over RBF. Here is the description about CJSDs: https://www.quora.com/Is-the-Jensen-Shannon-Divergence-limited-in-0-1-Given-two-models-is-it-correct-that-the-larger-JSD-is-the-more-similar-they-are-to-each-other

and here are the references for CJSD Kernels:

https://ieeexplore.ieee.org/document/7424294/ https://ieeexplore.ieee.org/document/7796903/ https://link.springer.com/article/10.1007%2Fs41060-017-0054-1

The following paper will tell you the limitations of traditional SVM kernel methods, and why it lead to the development of CJSD based kernels:

"Investigating Manifold Neighborhood size for Nonlinear analysis of LIBS Amino Acid Spectra"

See if you can get hold on the following PhD Dissertation:

"Finding a Suitable Model For Novel Data Using Range Transformation"

as the above work is a subset of this broader research work.

Related Solutions

Solved – libsvm “reaching max number of iterations” warning and cross-validation

This warning means that the iterative routine used by LIBSVM to solve quadratic optimization problem in order to find the maximum margin hyperplane (i.e., parameters $w$ and $b$) separating your data reached the maximum number of iterations and will have to stop, while the current approximation for $w$ can be further enhanced (i.e., $w$ can be changed to make the value of the objective function more extreme). In short, that means the LIBSVM thinks it failed to find the maximum margin hyperplane, which may or may not be true.

There are many reasons why this may happen, I'd suggest you to do the following:

Normalize your data.
Make sure your classes are more or less balanced (have similar size). If they don't, use parameter -w to assign them different weights.
Try different $C$ and $\gamma$. Polynomial kernel in LIBSVM also has parameter 'coef0', as the kernel is $$\gamma \cdot u' \cdot v + \text{coeff}_0^{\text{ degree}}$$

It's a good idea to search optimal $C$ on a logarithmic scale, like you do. I think for normalized data the search range for $C$ that you suggested should be OK. A useful check: the accuracy of the classifier should not be changing much on the borders of that range and between two values of your set. If it does, extend the range or add intermediate values.

Note that LIBSVM distributive for Windows should contain a Python script called grid.py, which can do parameter selection for you (based on cross validation and specified search ranges). It can also produce contour plots for the accuracy of SVM. This tool may be quite helpful.

The following question on StackOverflow and its related questions might also help: libsvm Shrinking Heuristics

Solved – feature scaling giving reduced output (linear regression using gradient descent)

My guess is that you have accidentally transformed y_train (somewhere hidden in the code you have not posted). This because this reproducible snippets works

import numpy as np
import pandas as pd
import math
from sklearn import preprocessing

dat = pd.read_csv("/home/steffen/workspaces/airfoil/airfoil_self_noise.dat",sep="\t",low_memory=False,header=None)

apply_scaler = True

# split into train 2/3 and test 1/3
rng = np.random.RandomState(42)

n_rows = dat.shape[0]
n_train = math.floor(0.66*n_rows)

permutated_indices = rng.permutation(n_rows)

train_dat = dat.loc[permutated_indices[:n_train],:]
test_dat =  dat.loc[permutated_indices[n_train:],:]

# separate the response variable (last column) from the predictor variables
x_train = train_dat.iloc[:,1:-1]
y_train = (train_dat.iloc[:,-1])[:, np.newaxis]

x_test = test_dat.iloc[:,1:-1]
y_test = (test_dat.iloc[:,-1])[:, np.newaxis]

# train
# fit the scaler to predictor variables and apply it afterwards
scaler = preprocessing.StandardScaler().fit(x_train)

if apply_scaler:
    x_train = pd.DataFrame(scaler.transform(x_train))

# add constant one for the intercept parameter
x_train = pd.concat([pd.DataFrame(np.ones(shape=(x_train.shape[0],1)),index=x_train.index),x_train],axis=1)

# fit parameters of linear regression using batch gradient descent
# Hands-On Machine Learning with Scikit-Learn & Tensorflow, page 115
eta = 0.1 # learning rate
n_iterations = 1000
m = x_train.shape[0]
theta = rng.randn(x_train.shape[1],1)

for iteration in range(n_iterations):
    gradients = (2 / m) * x_train.T.dot(x_train.dot(theta) - y_train)
    theta = theta - eta * gradients

# to apply the fitted parameters, first we have to transform the test-data in the same way
# apply scaler
if apply_scaler:
    x_test = pd.DataFrame(scaler.transform(x_test))

# add constant one for the intercept parameter
x_test = pd.concat([pd.DataFrame(np.ones(shape=(x_test.shape[0],1)),index=x_test.index),x_test],axis=1)

# apply fitted parameters
y_predict =x_test.dot(theta)

# compare output
out=np.column_stack((y_test, y_predict))
print(pd.DataFrame(out).head())
# root mean squared error
print("error %f"% np.sqrt(np.power(y_test-y_predict,2).mean()))

This leads to this output

         0           1
0  120.573  127.108268
1  127.220  123.492931
2  113.045  122.393120
3  119.606  122.570836
4  131.971  127.270743
error 6.175637

which is fine.

It is interesting to see that for learning rate 0.1 this simple batch gradient descent implementation fails to converge if no normalization is performed (apply_scaler=False, eta=0.1), while the Linear Regression implementation of scikit learn still finds a solution. Reducing the learning rate dramatically (eta=0.0001) leads to convergence again.

This is one example where the Gradient Descent is limited, as discussed here: Do we need gradient descent to find the coefficients of a linear regression model.

Best Answer

Related Solutions

Solved – libsvm “reaching max number of iterations” warning and cross-validation

Solved – feature scaling giving reduced output (linear regression using gradient descent)

Related Question