Solved – Real World Challenge: Large difference between training and testing set accuracy

cross-validationmachine learning

I have a classification dataset of ~100,000 rows and ~200 features. Within the dataset my predictor variable (Y) is an integer value between 0-55, therefore I am trying to predict 1 of 56 possible classes. I split my training/testing set into 80/20% and performed an extensive 10-fold cross validation exercise to tune the parameters and fit the final model. I end up a very high accuracy (and F1) score in the training set (~90%) but a very slow accuracy (and F1) score in the testing set (~10%).

EDIT 1: The training and testing set are split randomly. About 150 of the features are binary features (0 or 1) and the others are continuous values which I center and scale to be between 0 and 1.

I have tried numerous learning algorithms (SVM, NN, logistic regression, PCA + SVM) and believe that the 10-fold CV should have eliminated overfitting as best as possible. However, nothing that I try seems to yield any meaningfully different results.

Can anyone suggest new ways to increase accuracy of the testing set?

Caveats:

1) This is real world data so getting more of it is very expensive and time consuming.

2) We need these 56 classes so cannot simply eliminate any.

Thanks and any suggestions are appreciated.

Best Answer

There may be many reasons for this. Below is a non-exhaustive list of bullet points; i.e. me thinking out loud.

  • Is your predictor variable (with the class labels) ordinal or nominal? If it is the former, how do you calculate accuracy? e.g. do you incorporate the fact that $Y_{predict} = 50$ is better than $Y_{predict}=10$, when $Y_{real} = 55$?

  • What are the label distributions for the 56 classes in the test and training sets? I understand that viewing and analysing a 56*56 confusion matrix is painful but the answer to your question is most likely in the analyses of the confusion matrices for the trainig and test sets. By collapsing the confusion matrices, you can get proportions of specific agreements, etc. per class label and then combine this information with the class distributions of your training and test sets. To sum up, accuracy on its own gives you a very limited view of the whole picture.

  • Attached to the first item, if you have selected the test data without shuffling the row indices, you may be suffering from temporal trends; e.g. if your test data is the most recent 20%, the class distributions for the data collected during this period may be different.

  • Also, is your data incomplete? If so, how do the ratios of incompleteness for the test and training sets compare? The more incomplete your subset, the less predictive information you have.