Solved – Akaike Information criterion for k-means

aicbick-meanslikelihood

I am trying use the AIC & BIC for selecting the number of clusters for k-means. I found a post on Stackoverflow where some R code is posted, but I could not find any valid explanation for its correctness.

The code:

kmeansAIC = function(fit){
  m = ncol(fit$centers)
  n = length(fit$cluster)
  k = nrow(fit$centers)
  D = fit$tot.withinss
  return(D + 2*m*k)
}

basically states

$$
AIC = -2 \mathcal{L}(\hat{\theta}) + 2p =
(\sum_{l=1}^k\sum_{i=1}^{n_l}(\vec{x_{i,l}}-\vec{\mu_l})^2)+2 d k
$$
where $k$ is the number of clusters, $n_l$ is the number of point in cluster $l$, $\mu_l$ is the cluster center of cluster $l$ and $d$ is the dimensionality of each point.

Now my problem is with the derivation of the log-likelihood function.
If I consider k-means as a form of an EM algorithm with spherical clusters I get a gaussian cluster $cl_i$ with density $f_{cl_i}$ like

$$
f_{cl_i}(\vec{x}, \vec{\mu_i}, \mathbf{\Sigma})
= \frac{1}{\sqrt{(2 \pi)^d det(\mathbf{\Sigma}})}
\exp(-\frac{1}{2} (\vec{x}-\vec{\mu})^T \mathbf{\Sigma}^{-1} (\vec{x}-\vec{\mu}))
$$
with $\bf{\Sigma}$ as covariance of the cluster $cl_i$.

As the cluster is spherical, $\bf{\Sigma}$ can be written as $\bf{\Sigma}=\sigma^2 \bf{I}$ where $\bf{I}$ is the identiy matrix.
This gives $f_{cl_i}$ as
$$
f_{cl_i}(\vec{x}, \vec{\mu_i}, \Sigma)
= \frac{1}{\sqrt{(2 \pi \sigma^2)^d}}
\exp(-\frac{(\vec{x}-\vec{\mu})^2}{2 \sigma^2})
$$
and therefore a log-likelihood function $\mathcal{L}$ for a cluster
$$
\begin{aligned}
\mathcal{L}(\Theta;\vec{x}) = \sum_{i=1}^{n} log(
\frac{1}{\sqrt{(2 \pi \sigma^2)^d}}
\exp(-\frac{(\vec{x_i}-\vec{\mu})^2}{2 \sigma^2})
) = \\
– n \log( \sqrt{(2 \pi \sigma^2)^d})
– \frac{1}{2 \sigma^2} \sum_{i=1}^{n}
(\vec{x_i}-\vec{\mu})^2
\end{aligned}
$$

I understand, that the first term can be omitted, as I am only interested in differences of the AIC for different $k$ and that term is constant over $k$.

This leads me to the following formula for the AIC:
$$
AIC = \frac{1}{\sigma^2} (\sum_{l=1}^k\sum_{i=1}^{n_l}(\vec{x_{i,l}}-\vec{\mu_l})^2) + 2 k d
$$

which differs in the factor $\frac{1}{\sigma^2}$ compared to the formula from Stackoverflow.

I know, that $\sigma^2$ does not matter for k-means itself but it is obvious, that a greater $\sigma^2$ penalizes the model complexity in the AIC. Thus it seems strange to me, to just assume $\sigma^2 = 1$.

Is there an error in my line of argumentation or am I missing something?

I also used the same concept for calculation of the Bayesian Information Criterion and observed, that I end up with a suspiciously large number of clusters if I select the minimum of the BIC. The BIC drops very fast in the beginning and then decreases very slowly.

Is it common to select an AIC/BIC value where the scoring function doesn't show big decreases anymore (similar to the elbow technique) or should it always be the minimum of the function?

Best Answer

I think that the answer you refer to in the link in Stackoverflow, is in turn based on another link: http://sherrytowers.com/2013/10/24/k-means-clustering/.

In that post the data used is standardized the variance is equal to 1 and the formula makes sense. In any case I think your line of reasoning is absolutely correct, and one should refrain to use that expression if the underlying assumption is not correct.

Related Question