Solved – X-mean algorithm BIC calculation question

bick-means

I'm having trouble understanding some of the formulas in this paper related to BIC calculation (Dan Pelleg and Andrew Moore, X-means: Extending K-means with Efficient Estimation of the Number of Clusters).

First the variance equation:

R – number of points
K – number of clusters
$\mu_i$ – centroid associated with ith point.
$\sigma^2 = \frac{1}{R-K}\sum_{i}(x_i – \mu_{(i)})^2 $

The log likelyhood then uses this sigma.
Am I reading this right, they're using 1 covariance matrix for all clusters (see quote below, they are)? This makes no sense. If you have 5 clusters, each one is a Gaussian according to k-means algorithm. So wouldn't it make sense to compute covariance $\sigma^2_i$ for each cluster and use that?

My second question is regarding number of parameters to use in the BIC score. The paper mentions

Number of free parameters $p_j$ is simply the sum of K-1 class
probabilities, M*K centroid coordinates, and one variance estimate.

How do you get the K-1 class probabilities? I could do # of points in class i / total number of points. But then it's K-1, which probability is left out of the sum?

P.S. If anyone has a nicer paper on estimating k using similar methods I'd like to read that as well. At this point I'm not too concerned with speed.

Thanks for your help.

Best Answer

Let the clusters be indexed by $j = 1, \ldots, K$ with $K_j \gt 0$ points in cluster $j$. Let $\mu_j$ (no parentheses around the subscript) designate the mean of cluster $j$. Then, because by definition $\mu_{(i)}$ is the mean of whichever cluster $x_i$ belongs to, we can group the terms in the summation by cluster:

$$\eqalign{ \sigma^2 &= \frac{1}{R-K}\sum_{i}(x_i - \mu_{(i)})^2 \\ &= \frac{1}{R-K}\sum_{j=1}^K\sum_{k=1}^{K_j}(x_k - \mu_j)^2 \\ &= \frac{1}{R-K}\sum_{j=1}^K K_j \frac{1}{K_j}\sum_{k=1}^{K_j}(x_k - \mu_j)^2 \\ &= \frac{1}{R-K}\sum_{j=1}^K K_j \sigma_j^2 },$$

with $\sigma_j^2$ being the variance within cluster $j$ (where we must use $K_j$ instead of $K_j-1$ in the denominators to handle singleton clusters). I believe this is what you were expecting.

Related Solutions

Using BIC – Estimating Number of Clusters in K-Means with Python

It seems you have a few errors in your formulas, as determined by comparing to:

np.sum([n[i] * np.log(n[i]) -
               n[i] * np.log(N) -
             ((n[i] * d) / 2) * np.log(2*np.pi) -
              (n[i] / 2) * np.log(cl_var[i]) -
             ((n[i] - m) / 2) for i in range(m)]) - const_term

Here there are three errors in the paper, fourth and fifth lines are missing a factor of d, the last line substitute m for 1. It should be:

np.sum([n[i] * np.log(n[i]) -
               n[i] * np.log(N) -
             ((n[i] * d) / 2) * np.log(2*np.pi*cl_var) -
             ((n[i] - 1) * d/ 2) for i in range(m)]) - const_term

The const_term:

const_term = 0.5 * m * np.log(N)

should be:

const_term = 0.5 * m * np.log(N) * (d+1)

The variance formula:

cl_var = [(1.0 / (n[i] - m)) * sum(distance.cdist(p[np.where(label_ == i)], [centers[0][i]], 'euclidean')**2)  for i in range(m)]

should be a scalar:

cl_var = (1.0 / (N - m) / d) * sum([sum(distance.cdist(p[np.where(labels == i)], [centers[0][i]], 'euclidean')**2) for i in range(m)])

Use natural logs, instead of your base10 logs.

Finally, and most importantly, the BIC you are computing has an inverse sign from the regular definition. so you are looking to maximize instead of minimize

Solved – Probability for selecting centroids – K-means++

The paper (see section 2.2) suggests that you use the squared distance when computing probabilities. In fact, you can try distance$^{\ell}$ using any exponent $\ell$ greater than 1. It stands to reason that as $\ell$ increases, the likelihood of initial centroids being close together will go to zero.

Best Answer

Related Solutions

Using BIC – Estimating Number of Clusters in K-Means with Python

Solved – Probability for selecting centroids – K-means++

Related Question