Solved – cluster analysis, Ward: how to evaluate number of clusters and their quality

clustering

I have a table of similarities (cosines) and I clustered it with the Ward method. Great outcomes, a wonderful dendogram, but then I tried to evaluate the quality of this cluster solution and I got stuck.

First: identifying the number of clusters in my data (cause in Ward is not like k-means where you have to set a precise number of clusters). I calculated the sum of squares (see attachment) to see how many clusters are there, but there isn't a proper "elbow" in the data, so how many clusters shall I consider?

Second: trying to calculate the purity of the clustering (with the tool CluTo), by indicating 4, 5, 6, 7… clusters, I can see that the purity increases the more clusters I indicate. Of course. If the number of clusters equals the number of instances of my data, then purity is 1 (the maximum). dah.

Any suggestion on how to report this? (number of clusters? quality of the clustering solution?)

enter image description here

Best Answer

If you are using R, this is a really nice page (http://www.statmethods.net/advstats/cluster.html) that steps though a few different methods to help in identifying the optimal number of clusters. HTH.

Related Solutions

Solved – Distance between two Gaussian mixtures to evaluate cluster solutions

Suppose we have two Gaussian mixtures in $\mathbb R^d$:$\DeclareMathOperator{\N}{\mathcal N} \newcommand{\ud}{\mathrm{d}} \DeclareMathOperator{\E}{\mathbb E} \DeclareMathOperator{\MMD}{\mathrm{MMD}}$ $$ P = \sum_{i=1}^{n} \alpha_i P_i = \sum_{i=1}^n \alpha_i \N(\mu_i, \Sigma_i) \qquad Q = \sum_{j=1}^m \beta_j Q_j = \sum_{j=1}^m \N(m_j, S_j) .$$ Call their densities $p(\cdot)$ and $q(\cdot)$, respectively, and denote the densities of their components $P_i$, $Q_j$ by $p_i(x) = \N(x; \mu_i, \Sigma_i)$, $q_j(x) = \N(x; m_j, S_j)$.

The following distances are available in closed form:

$L_2$ distance, as suggested in a comment by user39665. This is: \begin{align} L_2(P, Q)^2 &= \int (p(x) - q(x))^2 \,\ud x \\&= \int \left( \sum_{i} \alpha_i p_i(x) - \sum_j \beta_j q_j(x) \right)^2 \ud x \\&= \sum_{i,i'} \alpha_i \alpha_{i'} \int p_i(x) p_{i'}(x) \ud x + \sum_{j,j'} \beta_j \beta_{j'} \int q_j(x) q_{j'}(x) \ud x \\&\qquad - 2 \sum_{i,j} \alpha_i \beta_j \int p_i(x) q_j(x) \ud x .\end{align} Note that, as seen for example in section 8.1.8 of the matrix cookbook: \begin{align} \int \N(x; \mu, \Sigma) \N(x; \mu', \Sigma') \,\ud x &= \N(\mu; \mu', \Sigma + \Sigma') \end{align} so this can be evaluated easily in $O(m n)$ time.
The maximum mean discrepancy (MMD) with a Gaussian RBF kernel. This is a cool distance, not yet super-well-known among the statistics community, that takes a bit of math to define.

Letting $$k(x, y) := \exp\left( - \frac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right),$$ define the Hilbert space $\mathcal{H}$ as the reproducing kernel Hilbert space corresponding to $k$: $k(x, y) = \langle \varphi(x), \varphi(y) \rangle_{\mathcal H}$.

Define the mean map kernel as $$ K(P, Q) = \E_{X \sim P, Y \sim Q} k(X, Y) = \langle \E_{X \sim P} \varphi(X), \E_{Y \sim Q} \varphi(Y) \rangle .$$

The MMD is then \begin{align} \MMD(P, Q) &= \lVert \E_{X \sim P}[\varphi(X)] - \E_{Y \sim Q}[\varphi(Y)] \rVert \\&= \sqrt{K(P, P) + K(Q, Q) - 2 K(P, Q)} \\&= \sup_{f : \lVert f \rVert_{\mathcal H} \le 1} \E_{X \sim P} f(X) - \E_{Y \sim Q} f(Y) .\end{align}

For our mixtures $P$ and $Q$, note that $$ K(P, Q) = \sum_{i, j} \alpha_i \beta_j K(P_i, Q_j) $$ and similarly for $K(P, P)$ and $K(Q, Q)$.

It turns out, using similar tricks as for $L_2$, that $K(\N(\mu, \Sigma), \N(\mu', \Sigma'))$ is $$ (2 \pi \sigma^2)^{d/2} \N(\mu; \mu', \Sigma + \Sigma' + \sigma^2 I) .$$

As $\sigma \to 0$, clearly this converges to a multiple of the $L_2$ distance. You'd normally want to use a different $\sigma$, though, one on the scale of the data variation.

Closed forms are also available for polynomial kernels $k$ in the MMD; see

Muandet, Fukumizu, Dinuzzo, and Schölkopf (2012). Learning from Distributions via Support Measure Machines. In Advances in Neural Information Processing Systems (official version). arXiv:1202.6504.

For a lot of nice properties of this distance, see

Sriperumbudur, Gretton, Fukumizu, Schölkopf, and Lanckriet (2010). Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11, 1517–1561. arXiv:0907.5309.
Quadratic Jensen-Rényi divergence. The Rényi-$\alpha$ entropy is defined as $$ H_\alpha(p) = \frac{1}{1-\alpha} \log\left( \int p(x)^\alpha \,\ud x \right) .$$ Its limit as $\alpha \to 1$ is the Shannon entropy. The Jensen-Rényi divergence is $$ \mathrm{JR}_\alpha(p, q) = H_\alpha\left( \frac{p + q}{2} \right) - \frac{H_\alpha(p) + H_\alpha(q)}{2} $$ where $\frac{p + q}{2}$ denotes an equal mixture between $p$ and $q$. It turns out that, when $\alpha = 2$ and when $P$ and $Q$ are Gaussian mixtures (as here), you can compute a closed form for $\mathrm{JR}_2$. This was done by

Wang, Syeda-Mahmood, Vemuri, Beymer, and Rangarajan (2009). Closed-Form Jensen-Renyi Divergence for Mixture of Gaussians and Applications to Group-Wise Shape Registration. Med Image Comput Comput Assist Interv., 12(1), 648–655. (free pubmed version)

Solved – Segmentation using cluster analysis in SPSS

In my opinion, the quality of your cluster solution is inherently subjective. There's no right or wrong but rather "makes sense and is actionable for us" or "doesn't make sense, is not actionable".

Now first, what you should know is that all three clustering algorithms are affected by the order of your cases. This really raises questions about the soundness of the procedure altogether. But given that there's randomness in the result anyway, you may perhaps just as well capitalize on it as long as you're transparent about what you did and your reasoning behind it.

You can run the same clustering syntax repeatedly (say 10 times) and randomly reorder cases between runs. You can automate this with Python, an example that comes close to this is Regression over Many Dependent Variables. Now you'll get 10 different solutions, compare their quality and choose what you find best.

My personal opinion is that two pieces of information matter:

1 How are the frequencies of cases over clusters? You usually don't want very small or very large clusters.
2 Can you describe what the clusters mean? For this you may run means plots of the input variables with the cluster variable.

if these results make sense (they're explicable) and they're actionable, I'd say you have a good solution.

Best Answer

Related Solutions

Solved – Distance between two Gaussian mixtures to evaluate cluster solutions

Solved – Segmentation using cluster analysis in SPSS

Related Question