machine-learning – What Are Core Statistical Model Validation Techniques?

cross-validationdescriptive statisticsmachine learningstatistical significancevalidation

I am self-taught machine-learning Data Science enthusiast. Over the course of self-learning, I have come across various validation techniques such as LOOCV, K-fold cross-validation, the bootstrap method and use them frequently. However, I came across an article where it was mentioned that core statisticians do not treat these above methods as their go-to validation techniques.

I have been wondering ever since about the validation techniques that hard-core statisticians consider and/or use as model validation techniques. I am assuming that it would not just be hypothesis testing and p-values. Is there something fundamental I am missing about how core-statisticians work during the model validation process?

Best Answer

In machine learning, the overall goal of modeling is to make accurate predictions. Cross-validation is a method for estimating the accuracy of a model's predictions on unobserved cases; when you optimize your model using CV, you're selecting a final model based on its ability to make predictions. For example, if you have a regression model, you don't care whether you have the "true" values of the parameters; you want the values that make the most accurate predictions. (For many machine learning models, like random forests or neural nets, it doesn't even make sense to ask what the "true" values of the parameters are.)

In what we might call "classical statistics," the overall goal of modeling is to draw reliable inferences about unobserved parameters describing a population of interest. For example, working with a regression model, the classical statistician wants reliable estimates of the true values of the regression coefficients. She doesn't care (so much) whether the model makes accurate predictions. Because their goal is reliable inference, many of the methods of classical statistics revolve around identifying and eliminating sources of bias.

So, for a classical statistician, model validation is much more involved than applying CV and selecting the model with the maximum accuracy/minimum error. The statistician will consider things like the following:

Are there potential sources of sampling bias in the way the data were generated?
Qualitatively, is the model a good model of how the data were generated?
Quantitatively, are the model's mathematical assumptions satisfied? For example, with the regression, are the covariates uncorrelated, and are the residuals Gaussian and homoscedastic?
Assuming hypothesis testing is being used, does the study design have enough power to detect the expected effect?

From the machine learning perspective, classical statisticians worry too much about a lot of unimportant details; the only important thing is to get the predictions right. (Compare Breiman.) From the classical statistics perspective, machine learning methods are opportunistic and unreliable. ML advocates are willing to use questionable data, and there's no reason to think that their methods will lead us to any underlying truth.