Solved – Does the dataset size influence a machine learning algorithm

large datamachine learning

So, imagine having access to sufficient data (millions of datapoints for training and testing) of sufficient quality. Please ignore concept drift for now and assume the data static and does not change over time. Does it even make sense to use all of that data in terms of the quality of the model?

Brain and Webb (http://www.csse.monash.edu.au/~webb/Files/BrainWebb99.pdf) have included some results on experimenting with different dataset sizes. Their tested algorithms converge to being somewhat stable after training with 16,000 or 32,000 datapoints. However, since we're living in the big data world we have access to data sets of millions of points, so the paper is somewhat relevant but hugely outdated.

Is there any know more recent research on the impact of dataset sizes on learning algorithms (Naive Bayes, Decision Trees, SVM, neural networks etc).

When does a learning algorithm converge to a certain stable model for which more data does not increase the quality anymore?
Can it happen after 50,000 datapoints, or maybe after 200,000 or only after 1,000,000?
Is there a rule of thumb?
Or maybe there is no way for an algorithm to converge to a stable model, to a certain equilibrium?
Why am I asking this? Imagine a system with limited storage and a huge amount of unique models (thousands of models with their own unique dataset) and no way of increasing the storage. So limiting the size of a dataset is important.

Any thoughts or research on this?

Best Answer

You are referring to what is known as Sample Complexity in the PAC learning framework. There has been significant amount of research in this area. In summary, in most real world cases, you never know what the true sample complexity is for a given dataset, however, you can bound it. The bounds typically are very loose and do not usually convey anything more than the order of examples required, to reach a particular error with a particular probability.

For instance, to reach a prediction error within epsilon, with a large probability (1 - delta), you may need the number of samples proportional to some function of epsilon and delta. For example, if your sample complexity is O(1/epsilon) you are better off than your complexity being (1/epsilon^2). That is, to reach 1% error rate, in the former case you need O(100) examples, and O(10000) in the second. But remember, these are still O(.) and not exact numbers.

If you look up sample complexity bounds of particular classes of algorithms, you'd get some idea. Some lecture notes here.

Related Question