Solved – Why isn’t Akaike information criterion used more in machine learning

aicbicmachine learningmodel selection

I just ran into "Akaike information criterion", and I noticed this large amount of literature on model selection (also things like BIC seem to exist).

Why don't contemporary machine learning methods take advantage of these BIC and AIC model selection criteria?

Best Answer

AIC and BIC are used, e.g. in stepwise regression. They are actually part of a larger class of "heuristics", which are also used. For example the DIC (Deviance Information Criterion) is often used in Bayesian Model selection.

However, they are basically "heuristics". While it can be shown, that both the AIC and BIC converge asymptotically towards cross-validation approaches (I think AIC goes towards leave-one-out CV, and BIC towards some other approach, but I am not sure), they are known to under-penalize and over-penalize respectively. I.e. using AIC you will often get a model, which is more complicated than it should be, whereas with BIC you often get a model which is too simplistic.

Since both are related to CV, CV is often a better choice, which does not suffer from these problems.

Then finally there is the issue of the # of parameters which are required for BIC and AIC. With general function approximators (e.g. KNNs) on real-valued inputs, it is possible to "hide" parameters, i.e. to construct a real number which contains the same information as two real numbers (think e.g. of intersecting the digits). In that case, what is the actual number of parameters? On the other hand, with more complicated models, you may have constraints on your parameters, say you can only fit parameters such that $\theta_1 > \theta_2$ (see e.g. here). Or you may have non-identifiability, in which case multiple values of the parameters actually give the same model. In all these case, simply counting of parameters does not give a suitable estimate.

Since many contemporary machine-learning algorithms show these properties (i.e. universal approximation, unclear number of parameters, non-identifiability), AIC and BIC are less useful for these model, than they may seem at first glance.

EDIT:

Some more points that could be clarified:

It seems I was wrong to consider the mapping by interleaving digits a bijection between $\mathbb{R}\rightarrow\mathbb{R}^N$ (see here). However, the details of why this isn't a bijection are a bit hard to understand. However, we don't actually need a bijection for this idea to work (a surjection is enough).
According to proof by Cantor (1877) there must be a bijection between $\mathbb{R}\rightarrow\mathbb{R}^N$. Although this bijection cannot be defined explicitly, it's existence can be proven (but this requires the unproven axiom of choice). This bijection can still be used in a theoretical model (it may not be possible to actually implement this model in a computer), to unpack a single parameter into an arbitrary number of parameters.
We don't actually need the mapping between $\mathbb{R}\rightarrow\mathbb{R}^N$ to be a bijection. Any surjective function $\mathbb{R}\rightarrow\mathbb{R}^N$ is enough to unpack multiple parameters from a single one. Such surjections can be shown to exists as limits to a sequence of other functions (so called space-filling curves, e.g. Peano curve).
Because neither the proof by Cantor is constructive (it simply proves the existence of the bijection without giving an example), nor the space-filling curves (because they only exist as limits of constructive objects and therefore are not constructive themselves), the argument I made is only a theoretical proof. In theory, we could just keep adding parameters to a model to reduce the BIC below any desired value (on the training set). However, in an actual model implementation we have to approximate the space-filling curve, so approximation error may prohibit us from actually doing so (I have not actually tested this).
Because all this requires the axiom of choice, the proof becomes invalid if you don't accept this axiom (although most mathematicians do so). That means, in constructive math this may not be possible, but I don't know what role constructive math plays for statistics.
Identifiability is intrinsically linked to functional complexity. If one simply takes an identifiable $N$-parameter model and adds a superfluous parameter (e.g. not used anywhere), then the new model becomes non-identifiably. Essentially, one is using a model that has the complexity of the $\mathbb{R}^{N+1}$ to solve a problem that has complexity $\mathbb{R}^N$. Similarly, with other forms of non-identifiability. Take for example the case of non-identifiable parameter permutations. In that case, one is using a model that has the complexity of the $\mathbb{R}^N$, however, the actual problem only has the complexity of a set of equivalence classes over the $\mathbb{R}^N$. However, this is only an informal argument, I don't know of any formal treatment of this notion of "complexity".

Related Solutions

Solved – AIC & BIC number interpretation

$AIC$ for model $i$ of an a priori model set can be recaled to $\mathsf{\Delta}_i=AIC_i-minAIC$ where the best model of the model set will have $\mathsf{\Delta}=0$. We can use the $\mathsf{\Delta}_i$ values to estimate strength of evidence ($w_i$) for the all models in the model set where: $$ w_i = \frac{e^{(-0.5\mathsf{\Delta}_i)}}{\sum_{r=1}^Re^{(-0.5\mathsf{\Delta}_i)}}. $$ This is often refered to as the "weight of evidence" for model $i$ given the a priori model set. As $\mathsf{\Delta}_i$ increases, $w_i$ decreases suggesting model $i$ is less plausible. These $w_i$ values can be interpreted as the probability that model $i$ is the best model given the a priori model set. We could also calculate the relative likelihood of model $i$ versus model $j$ as $w_i/w_j$. For example, if $w_i = 0.8$ and $w_j = 0.1$ then we could say model $i$ is 8 times more likely than model $j$.

Note, $w_1/w_2 = e^{0.5\Delta_2}$ when model 1 is the best model (smallest $AIC$). Burnham and Anderson (2002) term this as the evidence ratio. This table shows how the evidence ratio changes with respect to the best model.

Information Loss (Delta)    Evidence Ratio
0                           1.0
2                           2.7
4                           7.4
8                           54.6
10                          148.4
12                          403.4
15                          1808.0

Reference

Burnham, K. P., and D. R. Anderson. 2002. Model selection and multimodel inference: a practical information-theoretic approach. Second edition. Springer, New York, USA.

Anderson, D. R. 2008. Model based inference in the life sciences: a primer on evidence. Springer, New York, USA.

Solved – In Bayesian Information Criterion (BIC), why does having bigger n get penalized

I think the following will answer both of your questions.

First of all, you select the model that has the minimum value when using such criteria, therefore n has the opposite effect than you wrote down since increase in n alone will decrease the value.

Secondly, the information criteria is used to select between different models, not to select between different samples. The reason for these criteria to be used is the fact that adding more parameters will always increase the fit however it does not necessarily mean that the model is better due parsimony and degrees of freedom concerns in academy and overfitting concerns in practice.

A criterion such as BIC will be used to compare models that have different variables, where n will be the same. Therefore n is not there to penalize or favor the sample size. I am guessing it is there to normalize RSS, since RSS will increase indefinitely with n. On contrast, adding more parameters is penalized as it increases the value of the criteria.

Best Answer

Related Solutions

Solved – AIC & BIC number interpretation

Solved – In Bayesian Information Criterion (BIC), why does having bigger n get penalized

Related Question