Solved – Inference Statistics for Neural Networks

inferencemachine learning

Linear regression has been around for quite some time and has developed a set of statistics for measuring its performance like $p$-values, $R^2$, $F$-statistic, and so on, in order to catch internally inconsistent/poorly performing models. These statistics have been generalized to logistic regression and general GLM models.

However, when it comes to most machine learning packages, these statistics aren't included in favor of CV/OOS backtesting. In fact, most packages do not even bother to support these statistics at all. Why is that the case? Is there any software or even research in existence to support this kind of ability to do model selection?

Best Answer

It seems to me like these two different types of models are, many times, used for two different purposes. I will be paraphrasing from the first few chapters of Introduction to Statistical Learning.

On one hand, you have models for inference. These are often used to find support for or against theories. If Theory A argues X predicts Y, then traditionally we want to see that a coefficient $\hat{\beta}$ is "significant", $p < .05$, using test statistics like $t$ and $F$. These are less about model performance and more about theory. For example, many times we use arbitrary 1 - 7 point scales and contrived, laboratory settings. We do not really know how useful an effect size like $R^2$ might be, and we aren't really interested in predicting the future. We want results to be interpretable, as well, so we purposefully choose rigid parametric models that have assumptions like linearity.

On the other hand, you have models for prediction. These are often used more practically—to predict the future or categorize unseen data. We don't really care necessarily if one predictor is "significant". We just want our overall model to be able useful in predicting the future or categorizing unseen data, etc. It is less theory driven, so people use non-parametric, flexible models that are far harder to interpret theoretically. So we can measure performance—prediction—by seeing how far off our predictions are, which is where things like cross-validation and mean prediction standard error come into play.

In short, rigid models like linear regression and frequentist hypothesis testing using $p$-values and their related test statistics are more about drawing a theoretical inference from the data (e.g., the more prejudiced one is, the more they will discriminate against an outgroup member), so we just care that an effect is not zero (or, increasingly, is not a trivially small effect). Neural networks are more about making predictions from the data (e.g., how accurately can we predict discrimination from self-reported prejudice?), so we see how far off our predictions were (e.g., using cross-validation and looking for $MPSE$). Each type of statistic is used for a specific purpose.