Model Evaluation – Comparing Machine Learning and Statistical Model Evaluation Techniques

cross-validationhypothesis testingmachine learningstatistical significance

I'm a machine learning practitioner who spends most of my time doing applications via Python in machine vision with neural networks and remote sensing data, and evaluating model performance mostly using cross-validation based techniques etc.

Statistics has been (of course) part of my learning path and I am familiar with the basic concepts of statistics, but I am not actively doing the work of a "pure" statistician, that is, using e.g. R/SAS-software to fit generalized linear models, calculating the $t/p$-values, ANOVA and testing statistical significance of model covariates.

Even though I've been exposed to these fields for many years now, it still is not clear to me how (traditional/pure) statistics and machine learning differ from each other. I know that the general concensus on the difference is that statistics cares more about explaining the data, whereas machine learning (ML) is interested in making predictions, even though the differences are sometimes vague and both fields use the same methods.

Now in ML, usually when we have fitted some model into the data, we care about how well the model has learned the underlying pattern from the data, and we measure this learning by using an independent test data set. We never want to have a perfect fit of the model into the training data, because this would lead into overfitting and poor generalization capability. So the relevant question in ML (to my understanding) is never about "how good the fit of the model is to the data?", but rather "how well the model is able to predict new situations, that is, generalize".

When I listen to my statistician colleagues, I notice them talking mostly about the $t/p$-values, the $R^2$-values, $F$-statistics, differences in the means of two groups in ANOVA etc.

To me, from a ML perspective, my statistician colleagues seem to be concerned exactly about the "goodness-of-fit" of their models/covariates into the data and there is no independent test data set, which they use to validate their models. Well, of course, they don't do this, because this is not the goal, but rather to explain the data with the explicit assumptions we have made on the data (normality etc. the usual ones).

Now, I most probably am wrong due to my lack of knowledge on the subject, but it somehow seems odd to me why"pure" statisticians are so interested in these "goodness-of-fit"-statistics, because don't the $p$-values, $R^2$ etc. essentially in the end measure exactly this, that is the model fit (given that our distribution assumptions are correct)? For example, we all know that neural networks are universal approximators, meaning we can fit them with 100% accuracy into any (continuous) function we arbitrarily choose by just adding enough neurons into the model. Now, wouldn't this universal approximating neural network tuned into our data like hell have a huge statistical significance in the $p$-values or $R^2$-metrics, if we look at the model fit from a "pure" statistician's perspective? As a summary, would a statistician (does he?) conclude that: "we have found something truly significant now" in this neural network scenario? A ML scientist would produce an independent test sample, feed it to the network, and conclude that the model is overfitting the training data as hell and no pattern has been found. In other words, ML scientist would conclude "nothing significant has been found".

Maybe I asked my question too vaguely, but here it is summarized: Is it true to some extent that statisticians are usually more concerned about the model's goodness-of-fit and the corresponding metrics of significance, and not that much about model's generalization capability, and vice versa for the ML scientists?

Thank you in advance for any answers, which help me clear up my confusion 🙂

Best Answer

Is it true to some extent that statisticians are usually more concerned about the model's goodness-of-fit and the corresponding metrics of significance, and not that much about model's generalization capability, and vice versa for the ML scientists?

Scientists and analysts using "pure" statistics have recently gotten into some trouble precisely for focusing excessively on metrics of significance. In fact, in 2017, the American Statistical Association held a Symposium on Statistical Inference which to a large extent discussed the over-reliance that many statisticians have on statistical significance and p-values. It's been a fairly serious problem in the sciences recently.

But your question is difficult to answer, mostly because there are still no agreed upon definitions that mark the difference between statistics/machine learning/data science. A carpenter 2000 years ago used a hammer and nails. Now, a carpenter uses a power drill. When the power drill was invented, did the carpenters who started using it rename themselves? Did they also rename "nail" "multi-object concatenation device (MOCD)" and rename "hammer" "manual MOCD implementer?" Obviously not. That would be absurd. Yet machine learning practitioners and data scientists have more or less done this, and it makes statistics and machine learning seem to be more different than they really are.

High computing power, parallel processing, and new methods became available. Now suddenly a "variable" is a "feature!" You no longer recode your variables, you engage in feature engineering. "Pearson's phi" that was used 100 years ago, you say? No, no, no, it is now the MCC! The list goes on. Differences in jargon often do not reflect any real differences in the underlying mathematics or theory.

Having said that, to the extent that the focus of jobs called "statistician" versus "machine learning practitioner" are starting to diverge, I think you'll find that they diverge in largely the ways you expect. "Statistician" jobs tend to be more scientifically conservative, focusing on understanding relationships and testing assumptions. "Machine learning" jobs tend to be in industry where your goal is to deal quickly and efficiently with a lot of data in order to help make decisions that yield a desired result, which is rarely scientific understanding or evidence for/against hypotheses.

Related Question