How many datapoints are enough for a regression model to predict with reasoanble (say 88%-92%) accuracy?

accuracydatasetmachine learningregressionsample-size

Is there any number that we can land on for our regression model to predict with high accuracy? (accuracy metrics I have in mind at RMSE or R-squared). Also high accuracy may mean something above 88% to 90% accuracy with 95% confidence interval (I am not glued to any particular number)

In my current setup, we have to run some tests, but the issue is that we cannot collect a very large amount of dataset due to the limitation of time in running a single test. We have concluded that we will run around 80 tests (by calculating the avg. amount of time each test would take, plus the time needed to change the configuration etc.).

However, I am not convinced that this much amount of data points would sufficiently give us an accurate model.

Plus, for each test, we are running the test once only: My concern with this that the outcome we may have may be fluke. How many runs per test would nullify the chances of fluke statistically speaking?

Best Answer

We can't tell you. It depends on your situation and how easy prediction is in your situation.

How many coin tosses do you need to observe before you can predict the next one with 90% accuracy?

Related: How to know that your machine learning problem is hopeless?

And of course, in many situations you can predict with "better than 90% accuracy" without learning at all, namely when one outcome occurs in more than 90% of cases - then just always predict that. For instance: always classify a credit card transaction as non-fraudulent. Most CC transactions are non-fraudulent, so such a useless prediction will look very good in terms of accuracy... because accuracy is not a good evaluation measure.

Related Question