Solved – sklearn – overfitting problem

machine learningoverfittingpythonscikit learnsvm

I'm looking for recommendations as to the best way forward for my current machine learning problem

The outline of the problem and what I've done is as follows:

  • I have 900+ trials of EEG data, where each trial is 1 second long. The ground truth is known for each and classifies state 0 and state 1 (40-60% split)
  • Each trial goes through preprocessing where I filter and extract power of certain frequency bands, and these make up a set of features (feature matrix: 913×32)
  • Then I use sklearn to train the model. cross_validation is used where I use a test size of 0.2. Classifier is set to SVC with rbf kernel, C = 1, gamma = 1 (I've tried a number of different values)

You can find a shortened version of the code here: http://pastebin.com/Xu13ciL4

My issues:

  • When I use the classifier to predict labels for my test set, every prediction is 0
  • train accuracy is 1, while test set accuracy is around 0.56
  • my learning curve plot looks like this:

enter image description here

Now, this seems like a classic case of overfitting here. However, overfitting here is unlikely to be caused by a disproportionate number of features to samples (32 features, 900 samples). I've tried a number of things to alleviate this problem:

  • I've tried using dimensionality reduction (PCA) in case it is because I have too many features for the number of samples, but accuracy scores and learning curve plot looks the same as above. Unless I set the number of components to below 10, at which point train accuracy begins to drop, but is this not somewhat expected given you're beginning to lose information?
  • I have tried normalizing and standardizing the data. Standardizing (SD = 1) does nothing to change train or accuracy scores. Normalizing (0-1) drops my training accuracy to 0.6.
  • I've tried a variety of C and gamma settings for SVC, but they don't change either score
  • Tried using other estimators like GaussianNB, even ensemble methods like adaboost. No change
  • Tried explcitly setting a regularization method using linearSVC but didn't improve the situation
  • I tried running the same features through a neural net using theano and my train accuracy is around 0.6, test is around 0.5

I'm happy to keep thinking about the problem but at this point I'm looking for a nudge in the right direction. Where might my problem be and what could I do to solve it?

It's entirely possible that my set of features just don't distinguish between the 2 categories, but I'd like to try some other options before jumping to this conclusion. Furthermore, if my features don't distinguish then that would explain the low test set scores, but how do you get a perfect training set score in that case? Is that possible?

Best Answer

The problem

The RBF kernel function for two vectors $\mathbf{u}$ and $\mathbf{v}$ looks like this: $$ \kappa(\mathbf{u},\mathbf{v}) = \exp(-\gamma \|\mathbf{u}-\mathbf{v}\|^2). $$

Essentially, your results indicate that your values for $\gamma$ are way too high. When that happens, the kernel matrix essentially becomes the unit matrix, because $\|\mathbf{u}-\mathbf{v}\|^2$ is larger than 0 if $\mathbf{u}\neq \mathbf{v}$ and 0 otherwise which leads to kernel values of $\approx 0$ and 1 respectively when $\gamma$ is very large (consider the limit $\gamma=\infty$).

This then leads to an SVM model in which all training instances are support vectors, and this model fits the training data perfectly. Of course, when you predict a test set, all predictions will be identical to the model's bias $\rho$ because the kernel computations are all zero, i.e.: $$ f(\mathbf{z}) = \underbrace{\sum_{i\in SV} \alpha_i y_i \kappa(\mathbf{x}_i, \mathbf{z})}_{always\ 0} + \rho, $$ where $\mathbf{x}_i$ is the ith support vector and $\alpha_i$ is its corresponding dual weight.

The solution

Your search space needs to be expanded to far lower values of $\gamma$. Typically we use exponential grids, e.g. $10^{lb} \leq \gamma \leq 10^{ub}$, where the bounds are data dependent (e.g. $[-8, 2]$).

I suspect you're using grid search at the moment, which is a very poor way to optimize hyperparameters because it wastes most of the time investigating hyperparameters that aren't good for your problem.

It's far better to use optimizers that are designed for such problems, which are available in libraries like Optunity and Hyperopt. I'm the main developer of Optunity, you can find an example that does exactly what you need (i.e., tune a sklearn SVC) in our documentation.

Related Question