Solved – Understanding k-means unsupervised learning for features

k-meansmachine learningneural networksunsupervised learning

I'm following this paper:

http://ai.stanford.edu/~ang/papers/icdar01-TextRecognitionUnsupervisedFeatureLearning.pdf

And I'm trying to understand specifically how the k-means approach works when learning feature filters in a convolution map. The section I'm reading is III. A. page 2. From what I understand, we start out with a large set of normalized/ whitened filters,… say 3600 filters taken from various sample of training data and the key is to find a reduced set of filters,… maybe 96 filters that have a minimized euclidean distance from the 3600 original filters.

I understand that we a solving for a dictionary D which contain all learned filters with that small euclidean distance in their respective clusters. But I don't understand why we multiply D * s(i) in the following equation (1) of the paper:

Sum||(D * s(i) − x(i))||^2

If s(i) is a basis vector that belongs to D, then what point does D serve. I'm actually really confused how the k-means process is supposed to work under the particular theme that this paper talks about. What do they mean by that "one hot encoder" and if we're summing over the difference of s(i) and x(i), why would s and x have the same index? Wouldn't we want to see which cluster x belongs to to minimize our squares -> s(j) – x(i) : sum over j, and assign x to the proper s.

Also they don't explain the process of minimization. I understand how minimization works in normal k-clustering, but not in this context. How do you switch from minimizing D and then switching to minimize s? Lastly, how do we initialize s(i), do we just take s(i) from a column of D?

Best Answer

If I understand the paper correctly then this is essentially a generalization of the K-means algorithm, with generalized centroids. The technical term is "spherical K-means." See for example, this paper.

When you're doing K-means, you're partitioning your data into $K$ clusters $S=\{S_1,S_2,\cdots,S_K\}$ with centroids $\mu_i$ which minimize the overall distances in each cluster. To clarify, once you've chosen a cluster $S_i$ then that fixes $\mu_i$ to be the average of all observations in $S_i$.

In the case where your observations are vectors, then your centroids can be represented as vectors as well. Summarizing, for K-means, we are minimizing over all possible clusters:

$$\min_{S}\sum_{i=1}^K\sum_{x\in S_i}\|x-\mu_i\|^2.$$

Here each $x$ is a vector as is each $\mu_i$. Write $\mu_i=\|\mu_i\|D_i$, where $D_i$ is a unit vector. If we form a matrix $D$ of centroids, then we can gain access to $\mu_i$ by: $\mu_i=\|\mu_i\| D\delta_i$ where $\delta_i$ is the indicator vector on index $i$: $(\delta_i)_j=0$ for $j\neq i$ and 1 otherwise. In the paper, the job of $s^{(i)}$ is to select the centroid $\mu_k$ for observation $x^{(i)}$. There is one key difference here: the $\mu_i$'s in this optimization are no longer fixed to be the means of each cluster. Furthermore, you are allowed to dialate the centroids to minimize error between each data point in that class. So instead of $K$-means, we're really doing "$K$-centroids with dialation." So the minimization above now also minimizes over $\mu_i$.

So when they say "one hot encoder," they really just a particular hot centroid $\mu_i$, although it sounds like their goal is to emphasize the direction rather than the magnitude of $\mu_i$.

Why $K$-Means and $K$-Spherical means are different: take three points in $\mathbb{R}^2$ that form an equilateral triangle and let $K=2$. A moment's consideration will reveal that $K$-means will solve this by picking any pair of points, and assigning it to class 1 along with their (mean) centroid on the edge of the triangle, and take the remaining point and assign it to class 2 with it's (mean) centroid equal to the point. On the other hand, spherical $K$-means can do better. To see why, notice that if you were to assign the third point as usual with a centroid on-top of it, you could reverse this vector by multiplying by $-1$, and now it it is on-top of the $K$-means (mean) centroid between the first two points. Then you could take a second centroid vector and just point it at another one of the first two points. So your error should be half of $K$-means in this case.