Machine Learning – Understanding Why Parameters Go Untested in Machine Learning Algorithms

machine learningstatistical significance

I finished up a machine learning (ML) course a while back. Everything was as an optimization problem. No matter what predictive challenge you face in ML, you're generally minimizing some objective (i.e., cost) function. In so doing, you come up with "optimal" parameters that satisfy the equation you're working with (e.g., gradient descent and linear regression where MSE is the objective function you minimize).

Is it the case with all machine learning models that, when you find the optimal parameters that minimize your objective function, you are also, almost by definition, finding the same statistically significant coefficients you would otherwise discover if you were to attack the problem from a stats perspective where the focus is on tests of statistical significance? Let's define "stats perspective" as running a model in R and adding or deleting new variables based on their statistical significance or degree to which they change AIC .

Best Answer

Is it the case with all machine learning models that, when you find the optimal parameters that minimize your objective function, you are also, almost by definition, finding the same statistically significant coefficients you would otherwise discover if you were to attack the problem from a stats perspective where the focus is on tests of statistical significance?

In some cases, yes. If you're using logistic regression as a classifier, then optimizing the cost is the same as optimizing the log likelihood. Not every model is like this. Deep neural networks do not map nicely onto a statistical counterpart (though there is some active research on their statistical properties I imagine).

As to variable selection via AIC or similar, that would be a form of feature selection. I think that is something to ask about on its own.

As to your titular question, inference is not our main concern; its predictive capability. Besides, most ML problems (most, not all) work with data sets so large that significance becomes a straw man. The sheer size of the data would allow for high precision estimates, and since no effect is truly 0 you would find that all effects are significant.

Related Question