Solved – Expected best performance possible on a data set

machine learning

Say I have a simple machine learning problem like a classification. With some benchmarks in vision or audio recognition, I, as a human, are a very good classifier. I therefore have an intuition on how good a classifier can get.

But with lots of data one point is that I do not know how good the classifier I train is possible to get. This is data where I personally am not a very good classifier (say, classify the mood of a person from EEG data). It is not really possible to get an intuition on how hard my problem is.

Now, if I am presented with a machine learning problem, I would like to find out how good I can get. Are there any principled approaches to this? How would you do this?

Visualize data? Start with simple models? Start with very complex models and see if I can overfit? What are you looking for if you want to answer this question? When do you stop trying?

Best Answer

I do not know whether this counts as an answer ...

This is the one problem which keeps you up at night. Do you can build a better model ? Phd-comics sums it up nicely (I don't know whether I am allowed to upload the comics, so I just linked them)

From my personal experience, gained by participating in Machine Learning competitions, here is a rule of the thumb.

Imagine you are given a classification task. Sit down, brainstorm an hour or less how you'd approach the problem and check out the state of the art in this area. Build a model based on this research, preferably one which is known to be stable without too much parameter tweaking. The resulting performance will be roughly around 80% of the maximum achievable performance.

This rule is based on the so called Pareto principle, which also applies to optimization. Given a problem, you can create a solution which performs reasonable well fast, but from that point the ratio of improvement to time effort drops rapidly.

Some final words: When I read papers about new classification algorithms, I expect the authors to compare their new breed with such "pareto-optimized" approaches, i.e. I expect them to spend a reasonable amount of time to make the state of the art work (some require more or less parameter optimization). Unfortunately, many don't do that.

Related Question