What can i use to fix SVM overfitting when all expected solutions have failed

classificationmachine learningoverfittingsvmunbalanced-classes

I am trying to predict if a certain day will be good for agriculture based off a select number of features, the data has an imbalance of 5:1. There's a total of 1794 samples and 15 variables, columns are named "col_1" to "col_15". I have tried scaling my data, adding PCA before and after scaling and feature selection based off XGBoost's feature importance. Lowering the C parameter did reduce the overfit but it also lowered my overall test score, so i decided to keep it at 1. The variables also have low correlation between each other, ranging from -0.55 to 0.4. There are a total of 1794 samples and I'm using repeated stratified kfold as the cross validation and f1 as the evaluation metric. What can i do to decrease the overfit without also decreasing the test score? code:

#drop columns previously selected by recursive feature elimination
X = df.drop(['target','id','col_5','col_9','col_14'], axis = 1)
y = df['target']
svc = SVC(C=1,gamma=.1)
sc = StandardScaler()
over = SMOTE(sampling_strategy = .67)
steps = [('pca', PCA(n_components = 10)), ('scaler', sc), ('over', over), ('model', svc)]
pipe = Pipeline(steps = steps)
cv = RepeatedStratifiedKFold()
score = cross_validate(pipe, X, y, scoring = 'f1', cv = cv, return_train_score = True)
print('train:')
print('mean:',np.mean(scores['train_score'])*100)
print('std:',np.std(scores['train_score'])*100)
print('test:')
print('mean:',np.mean(scores['test_score'])*100)
print('std:',np.std(scores['test_score'])*100)
print('difference between means:',np.mean(scores['train_score'])*100 - np.mean(scores['test_score'])*100)```

evaluation results:


  [1]: https://i.stack.imgur.com/1JIBb.png

Best Answer

"Lowering the C parameter did reduce the overfit but it also lowered my overall test score, so i decided to keep it at 1. "

I suspect this is the start of the problem. Setting the C parameter is at the heart of best practice in the application of SVMs and it needs to be tuned very carefully, using an appropriately wide range of values. It is best to tune it's value on a logarithmic grid (so that it covers a wide range of magnitude). It is likely that you just haven't found the right value yet.

SVMs are based on the idea of structural risk minimisation, where you have a hypothesis classes of increasing complexity and pick the minimum level of complexity required to solve the problem. For "soft-margin" SVMs, this set of nested models is created by varying C (if you increase C it can still form all of the solutions it did before, but also some new ones that require slightly larger dual parameter values). So if you are not tuning C carefully, you are not using SRM and hence not applying the SVM according to best practice.

The gamma parameter needs to be tuned in conjunction with tuning C as both have a regularising effect.

"feature selection based off XGBoost's feature importance"

Feature selection often makes the performance of SVMs worse rather than better. The SVM is an approximate implementation of a bound on generalisation error that is independent of the dimensionality of the feature space, and there are rarely good theoretical bounds for feature selection. Most regularised models do not require feature selection, provided the regularisation parameters (C in this case) is well-tuned.

over = SMOTE(sampling_strategy = .67)

I personally would avoid SMOTE, at least until you have tried having different values of C for the two classes, which is effectively weighting their importance differently. This is likely to be more efficient and won't invalidate the theoretical underpinnings of the SVM. Way back in the mists of ancient history, I wrote a paper (preprint) about this, but back in the day there were many others.

What can i do to decrease the overfit without also decreasing the test score?

How are you defining/detecting over-fitting? If you are peeking at the test score in order to tune the architecture and hyper-parameters of the model, then that is "degrees of researcher freedom" and it is easy to over-fit the test score that way. AutoML (using the computer to select the architecture and hyper-parameters) is a useful addition to the machine learning toolbox. CyborgML where the operator becomes part of the model selection/tuning loop rather less so, because the steps taken by the researcher are rarely recorded (and so are not reproducible) and not taken into account when trying to avoid "overfitting in model selection" which can lead to biased performance estimation (paper). In short, don't look at the test scores until you have selected your final modelling approach.

Related Question