Solved – Corrected AIC (AICC) for k-means

aick-meansmodel selection

I want to calculate the $AIC_c$ (corrected $AIC$) for k-means to decide on the number of clusters, but there is an overfitting problem that I don't know how to solve. Let's say that I have $n$ data points of $d$ dimensions each, and I want to cluster those $n$ points into $c$ clusters. The Akaike Information Criterion ($AIC$) is $-2ln(L)+ 2k$ where $k$ is the number of free parameters, and for k-means, it is $c(d+1)$. And for $AIC_c$, the formula becomes $-2ln(L)+ 2kn/(n-k-1)$.

Now, let's say $n=1000$, $d=200$, $c=10$, then the number of parameters is $k=2010$. Then the denominator in the penalty term of $AIC_c$ becomes negative, which means a negative penalty in the formula. And when $c$ increases, the penalty decreases, resulting in $AIC_c$'s supporting the models with the maximum number of clusters (when $c=n$).

I think I am missing or I have misunderstood a point about $AIC_c$. What is that? Thanks in advance.

Best Answer

The form of $AICc$ of

$$ AICc = AIC + \frac{2k(k+1)}{n-k-1} $$

was proposed by

Hurvich, C. M.; Tsai, C.-L. (1989), "Regression and time series model selection in small samples", Biometrika 76: 297–307

specifically for a linear regression model with normally distributed errors. For different models, a different correction will need to be derived.

These derivations are often difficult and the resulting correction may be challenging to calculate. For instance

Hurvich, Clifford M., Jeffrey S. Simonoff, and Chih‐Ling Tsai. "Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60, no. 2 (1998): 271-293.

propose a correction to be used in the case of nonparametric regression models which takes the form

$$ AICc = -2ln(L) + n^2\int_0^1(1-t)^{r/2-2}\prod_{j=1}^{r}(1-t+2d_j)^{-1/2}dt+n\int_0^{\infty}\sum_{i=1}^n\frac{c_{ii}}{1+2d_it}\prod_{i=1}^n(1+2d_it)^{-1/2}dt $$

I will not go into the details here as they are largely irrelevant but I wanted to illustrate the complexity involved. Actual calculation of this value involves eigen-analysis and numerical integration.

For reasons like this, many authors such as

Burnham, K. P.; Anderson, D. R. (2002), Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (2nd ed.), Springer-Verlag, ISBN 0-387-95364-7

suggest to use the form

$$ AICc = AIC + \frac{2k(k+1)}{n-k-1} $$

regardless of model. Even Hurvich et al. (1998) despite deriving their complicated $AICc$ for nonparametric regression ultimately conclude that you might as well use the much simpler version for linear regression.

Generally this advice seems to work well, giving practically useful results. However there are circumstances, such as the one you've highlighted where it doesn't work. You would need to find an appropriate $AICc$ for k-means, or derive one yourself, or simply use $AIC$ which is more generally applicable.