Solved – Why isn’t Akaike information criterion used more in machine learning

aicbicmachine learningmodel selection

I just ran into "Akaike information criterion", and I noticed this large amount of literature on model selection (also things like BIC seem to exist).

Why don't contemporary machine learning methods take advantage of these BIC and AIC model selection criteria?

Best Answer

AIC and BIC are used, e.g. in stepwise regression. They are actually part of a larger class of "heuristics", which are also used. For example the DIC (Deviance Information Criterion) is often used in Bayesian Model selection.

However, they are basically "heuristics". While it can be shown, that both the AIC and BIC converge asymptotically towards cross-validation approaches (I think AIC goes towards leave-one-out CV, and BIC towards some other approach, but I am not sure), they are known to under-penalize and over-penalize respectively. I.e. using AIC you will often get a model, which is more complicated than it should be, whereas with BIC you often get a model which is too simplistic.

Since both are related to CV, CV is often a better choice, which does not suffer from these problems.

Then finally there is the issue of the # of parameters which are required for BIC and AIC. With general function approximators (e.g. KNNs) on real-valued inputs, it is possible to "hide" parameters, i.e. to construct a real number which contains the same information as two real numbers (think e.g. of intersecting the digits). In that case, what is the actual number of parameters? On the other hand, with more complicated models, you may have constraints on your parameters, say you can only fit parameters such that $\theta_1 > \theta_2$ (see e.g. here). Or you may have non-identifiability, in which case multiple values of the parameters actually give the same model. In all these case, simply counting of parameters does not give a suitable estimate.

Since many contemporary machine-learning algorithms show these properties (i.e. universal approximation, unclear number of parameters, non-identifiability), AIC and BIC are less useful for these model, than they may seem at first glance.

EDIT:

Some more points that could be clarified:

  1. It seems I was wrong to consider the mapping by interleaving digits a bijection between $\mathbb{R}\rightarrow\mathbb{R}^N$ (see here). However, the details of why this isn't a bijection are a bit hard to understand. However, we don't actually need a bijection for this idea to work (a surjection is enough).
  2. According to proof by Cantor (1877) there must be a bijection between $\mathbb{R}\rightarrow\mathbb{R}^N$. Although this bijection cannot be defined explicitly, it's existence can be proven (but this requires the unproven axiom of choice). This bijection can still be used in a theoretical model (it may not be possible to actually implement this model in a computer), to unpack a single parameter into an arbitrary number of parameters.
  3. We don't actually need the mapping between $\mathbb{R}\rightarrow\mathbb{R}^N$ to be a bijection. Any surjective function $\mathbb{R}\rightarrow\mathbb{R}^N$ is enough to unpack multiple parameters from a single one. Such surjections can be shown to exists as limits to a sequence of other functions (so called space-filling curves, e.g. Peano curve).
  4. Because neither the proof by Cantor is constructive (it simply proves the existence of the bijection without giving an example), nor the space-filling curves (because they only exist as limits of constructive objects and therefore are not constructive themselves), the argument I made is only a theoretical proof. In theory, we could just keep adding parameters to a model to reduce the BIC below any desired value (on the training set). However, in an actual model implementation we have to approximate the space-filling curve, so approximation error may prohibit us from actually doing so (I have not actually tested this).
  5. Because all this requires the axiom of choice, the proof becomes invalid if you don't accept this axiom (although most mathematicians do so). That means, in constructive math this may not be possible, but I don't know what role constructive math plays for statistics.
  6. Identifiability is intrinsically linked to functional complexity. If one simply takes an identifiable $N$-parameter model and adds a superfluous parameter (e.g. not used anywhere), then the new model becomes non-identifiably. Essentially, one is using a model that has the complexity of the $\mathbb{R}^{N+1}$ to solve a problem that has complexity $\mathbb{R}^N$. Similarly, with other forms of non-identifiability. Take for example the case of non-identifiable parameter permutations. In that case, one is using a model that has the complexity of the $\mathbb{R}^N$, however, the actual problem only has the complexity of a set of equivalence classes over the $\mathbb{R}^N$. However, this is only an informal argument, I don't know of any formal treatment of this notion of "complexity".