Solved – Low classification accuracy, what to do next

classificationfeature selectionrandom forestsvm

So, I'm a newbie in ML field and I try to do some classification. My goal is to predict the outcome of a sport event. I've gathered some historical data and now try to train a classifier.
I got around 1200 samples, 0.2 of them I split off for test purposes, others I put into grid search (cross-validation included) with different classifiers. I've tried SVM with linear, rbf and polynominal kernels and Random Forests to the moment.
Unfortunately, I can not get accuracy significantly larger than 0.5 (the same as random choice of class). Does it mean I just can't predict outcome of such a complex event? Or I can get at least 0.7-0.8 accuracy? If it's feasible, then what should I look into next?

  • Get more data? (I can enlarge dataset up to 5 times)
  • Try different classifiers? (Logistic regression, kNN, etc)
  • Reevaluate my feature set? Are there any ML-tools to analyze, which features make sense and which don't? Maybe, I should reduce my feature set (currently I have 12 features)?

Best Answer

First of all, if your classifier doesn't do better than a random choice, there is a risk that there simply is no connection between features and class. A good question to ask yourself in such a position, is weather you or a domain expert could infer the class (with an accuracy greater than a random classifier) based on given features. If no, then getting more data rows or changing the classifier won't help. What you need to do is get more data using different features.

IF on the other hand you think the information needed to infer the class is already in the labels, you should check whether your classifier suffers from a high bias or high variance problem.

To do this, graph the validation error and training set error, as a function of training examples.

If the lines seem to converge to the same value and are close at the end, then your classifier has high bias and adding more data won't help. A good idea in this case is to either change the classifier for a one that has higher variance, or simply lower the regularization parameter of your current one.

If on the other hand the lines are quite far apart, and you have a low training set error but high validation error, then your classifier has too high variance. In this case getting more data is very likely to help. If after getting more data the variance will still be too high, you can increase the regularization parameter.

This are the general rules I would use when faced with a problem like yours.

Cheers.

Related Question