I just ran into "Akaike information criterion", and I noticed this large amount of literature on model selection (also things like BIC seem to exist).
Why don't contemporary machine learning methods take advantage of these BIC and AIC model selection criteria?
Best Answer
AIC and BIC are used, e.g. in stepwise regression. They are actually part of a larger class of "heuristics", which are also used. For example the DIC (Deviance Information Criterion) is often used in Bayesian Model selection.
However, they are basically "heuristics". While it can be shown, that both the AIC and BIC converge asymptotically towards cross-validation approaches (I think AIC goes towards leave-one-out CV, and BIC towards some other approach, but I am not sure), they are known to under-penalize and over-penalize respectively. I.e. using AIC you will often get a model, which is more complicated than it should be, whereas with BIC you often get a model which is too simplistic.
Since both are related to CV, CV is often a better choice, which does not suffer from these problems.
Then finally there is the issue of the # of parameters which are required for BIC and AIC. With general function approximators (e.g. KNNs) on real-valued inputs, it is possible to "hide" parameters, i.e. to construct a real number which contains the same information as two real numbers (think e.g. of intersecting the digits). In that case, what is the actual number of parameters? On the other hand, with more complicated models, you may have constraints on your parameters, say you can only fit parameters such that $\theta_1 > \theta_2$ (see e.g. here). Or you may have non-identifiability, in which case multiple values of the parameters actually give the same model. In all these case, simply counting of parameters does not give a suitable estimate.
Since many contemporary machine-learning algorithms show these properties (i.e. universal approximation, unclear number of parameters, non-identifiability), AIC and BIC are less useful for these model, than they may seem at first glance.
EDIT:
Some more points that could be clarified: