Machine Learning – Does Machine Learning Need Data-Efficient Algorithms? An Analysis

efficiencymachine learningneural networkssample-sizesmall-sample

Deep learning methods are often said to be very data-inefficient, requiring 100-1000 examples per class, where a human needs 1-2 to reach comparable classification accuracy.

However, modern datasets are huge (or can be made huge), which begs the question of whether we really need data-efficient algorithms. Are there application areas where a data-efficient machine learning algorithm would be very useful, despite making trade-offs elsewhere, e.g. training or inference efficiency? Would an ML algorithm that is, say, 100x more data-efficient, while being 1000x slower, be useful?

People who work on data-efficient algorithms often bring up robotics for "motivation". But even for robotics, large datasets can be collected, as is done in this data-collection factory at Google:

enter image description here

Basically, my concern is that while data-efficient algorithms exist (e.g. ILP, graphical models) and could be further improved, their practical applicability is squeezed between common tasks, where huge datasets exist, and rare ones, that may not be worth automating (leave something for humans!).

Best Answer

You are not entirely wrong, often it will be a lot easier to collect more/better data to improve an algorithm than to squeeze minor improvements out of the algorithm.

However, in practice, there are many settings, where it is difficult to get really large dataset.

Sure, it's easy to get really large datasets when you use (self-/un-)supervised approaches or if your labels are automatically created (e.g. if you are Google whether a user clicks on a link or not). However, many practical problems rely on human experts (whose time may be expensive) to label the examples. When any human can do the job (e.g. labeling dog or cat or something else for ImageNet), this can be scaled to millions of images, but when you pay physicians to classify medical images, tens of thousands (or perhaps 100,000ish) labelled images is a pretty large dataset. Or, if you need to run a chemical experiment for each label.

Additionally, there may be cases, where the or the number of possible real-world examples is naturally limited (e.g. training data for forecasting winners of US presidential elections, predicting eruptions of a volcanoes from seismic data etc., which are just things for which we so far can only have so much data).

Related Question