Solved – Is splitting the data into test and training sets purely a “stats” thing

cross-validationdatasetexperiment-designmachine learningregression

I am a physics student studying machine learning / data science, so I don't mean for this question to start any conflicts 🙂 However, a big part of any physics undergraduate program is to do labs/experiments, which means a lot of data processing and statistical analysis. However, I notice a sharp difference between the way physicists deal with data and the way my data science / statistical learning books deal with data.

The key difference is that when trying to perform regressions to data obtained from physics experiments, the regression algorithms are applied to the WHOLE dataset, there is absolutely no splitting into training and test sets. In the physics world, the R^2 or some type of pseudo-R^2 is calculated for the model based on the whole data set. In the stats world, the data is almost always split up into 80-20, 70-30, etc… and then the model is evaluated against the test dataset.

There are also some major physics experiments (ATLAS, BICEP2, etc…) that never do this data splitting, so I'm wondering why there is a such a staunch difference between the way physicists/experimentalists do statistics and the way data scientists do statistics.

Best Answer

Not all statistical procedures split in to training/testing data, also called "cross-validation" (although the entire procedure involves a little more than that).

Rather, this is a technique that specifically is used to estimate out-of-sample error; i.e. how well will your model predict new outcomes using a new dataset? This becomes a very important issue when you have, for example, a very large number of predictors relative to the number of samples in your dataset. In such cases, it is really easy to build a model with great in-sample error but terrible out of sample error (called "over fitting"). In the cases where you have both a large number of predictors and a large number of samples, cross-validation is a necessary tool to help assess how well the model will behave when predicting on new data. It's also an important tool when choosing between competing predictive models.

On another note, cross-validation is almost always just used when trying to build a predictive model. In general, it is not very helpful for models when you are trying to estimate the effect of some treatment. For example, if you are comparing the distribution of tensile strength between materials A and B ("treatment" being material type), cross validation will not be necessary; while we do hope that our estimate of treatment effect generalizes out of sample, for most problems classic statistical theory can answer this (i.e. "standard errors" of estimates) more precisely than cross-validation. Unfortunately, classical statistical methodology¹ for standard errors doesn't hold up in the case of overfitting. Cross-validation often does much better in that case.

On the other hand, if you are trying to predict when a material will break based on 10,000 measured variables that you throw into some machine learning model based on 100,000 observations, you'll have a lot of trouble building a great model without cross validation!

I'm guessing in a lot of the physics experiments done, you are generally interested in estimation of effects. In those cases, there is very little need for cross-validation.

¹One could argue that Bayesian methods with informative priors are a classical statistical methodology that addresses overfitting. But that's another discussion.

Side note: while cross-validation first appeared in the statistics literature, and is definitely used by people who call themselves statisticians, it's become a fundamental required tool in the machine learning community. Lots of stats models will work well without the use of cross-validation, but almost all models that are considered "machine learning predictive models" need cross-validation, as they often require selection of tuning parameters, which is almost impossible to do without cross-validation.

Best Answer

Related Solutions

Testing Data – How to Ensure Testing Data Does Not Leak into Training Data

Solved – Splitting data into training, validation and test sets

Related Question