Solved – Methods for comparing clustering results

clusteringrunsupervised learningweka

I am doing an unsupervised clustering analysis for a genomics project. This means that I do not know when a particular clustering analysis is good or not.

I am running different clustering algorithms and different 'sets of features'. What I mean with different 'sets of features' is that given a data frame, I choose different combination of columns depending on its biological importance. For instance, some variables measure things at the sequence level, while others are measuring a particular cellular process or some other feature that cannot be measured at the sequence level. I am playing around with the different outputs of these sets of features, running the algorithms with all the features, or ignoring some, etc .

What I want is to compare the different clusters of these different runs and see if some of my objects are being clustered similarly despite lacking some sets of features. Does this make sense?

Is there any recommendation on how can I do this?

Best Answer

You can use the Adjusted Rand Index or the Adjusted Mutual Information to measure the similarity (agreement) of the overall results of two clustering algorithms on an overlapping dataset.

Both scores are adjusted for chance which means that 2 random clusterings will likely have an ARI or AMI close to 0.0.

Furthermore you can use those measure for model selection (e.g. finding the number of k in k-means) by running the clustering algorithm twice on 2 overlapping samples of the datasets and measuring the agreement on the overlap. The assumption is that a high agreement on the overlap means a higher stability of the algorithm and hence a better value for k (it captures better the real structure of the dataset).

A Novel Approach for Automatic Number of Clusters Detection in Microarray Data based on Consensus Clustering by Nguyen and Epps is probably the best reference for this method and it is further applied to microarray data.

Related Solutions

Solved – Feature selection for clustering problems

I have a few thoughts to share about dimension reduction in unsupervised learning problems. In answering, I've assumed that your interest is in "high-touch," human involvement wrt cluster interpretation as opposed to an automated, turnkey, black box and "low-touch" machine learning approach in which interpretation is deliberately de-emphasized. If it were the latter, why would you even be asking the question? Also, note that I've had a ton of experience running cluster solutions across a wide range of business environments over the years including strategic B2C marketing, B2B tech arenas and education policy (clustering students and schools).

First though, I do have a question about your comment concerning "grouping different datasets." I didn't know what you meant by that or how it might impact the approach and was hoping you could elaborate.

I would like to challenge your assumption in #1 above that solutions based on PCAs are "hard to interpret." The reasons for even running a PCA as a preliminary step in clustering have mostly to do with the hygiene of the resulting solution insofar as many clustering algorithms are sensitive to feature redundancy. PCA collapses this redundancy into a manageable handful of components, thereby minimizing the challenges and difficulties that you note regarding feature selection. While it is true that the components output from a PCA blur the granularity and specificity of the individual features, this is a problem iff you solely rely on those components in analyzing the results. In other words, you are not in any way locked into using only the components for cluster interpretation. Not only that, you don't necessarily even need to care what the factor dimensions "mean." They are only an intermediate and (ultimately) disposable means to an end facilitating an actionable solution. But in making this point I differ from many practitioners since teams can, will and do spend weeks carefully building a "meaningful" factor solution. To me, this is an inefficient waste of client time and money.

At this point there will be a boatload of technical considerations to address. For one, if your PCA algorithm is not scale invariant (e.g., is OLS vs ML), then any resulting PCA solution will be distorted, loading more heavily on the high variance features. In these cases your features need to be preprocessed or transformed in some way to flatten this variance out. There are a huge number of possibilities here including mean standardizing, range or IQR standardizing, ipsative scaling, and so on. Leverage that transformation delivering the best, most interpretable solution.

Once a cluster solution is generated, interpretation is best motivated (in my experience) by ignoring the components and folding back in the original features along with any additional descriptive information not directly used in the solution. At this point a few heuristics are the best guides to qualitative insight. This can be as easy as generating a spreadsheet that profiles your clusters based on averages or medians for each feature (the rows of the sheet), for each cluster (the columns) as well as an additional column representing the grand mean for your total sample. Then, by indexing the cluster averages for each feature against the grand mean (and multiplying by 100), a heuristic is created that is like an IQ score insofar as around "100" is "normal" IQ or average behavior, indexes of 120+ are suggestive of high likelihoods for a feature to be "true" about the behavior of a cluster and indexes of 80 or less are indicative of features that are "not true" of a cluster. These indexes of 120+ and 80 or less are like proxy t-tests for significance of a given feature in driving the solution. Of course, you can run between group tests of significance and, depending on sample sizes, will get answers that vary around these quick and dirty rules of thumb.

Ok...after all of that, suppose you're still opposed to using PCA as direct input into a clustering algorithm, the problem remains regarding how to select a reduced set of features. PCA can still be useful here since PCAs are like running a regression without a dependent variable. The top loading features on each component can become the inputs into the cluster algorithm.

To your point about the large number of features and relatively small sample size of your data, the typical rule of thumb in many "full information" multivariate analyses is a minimum of about 10 observations per feature. There are some specialized methods that can be leveraged to work around this challenge. For instance, partial least squares (PLS) was first developed by Herman Wold in his 1990 book Theoretical Empiricism for use in fields such as chemometrics which face this precise issue. It is factor-analytic in nature but is much less stringent in requiring a large n to generate the dimensions. Other solutions include the random forest-like, "divide and conquer," machine learning approaches used with massive amounts of information. These methods are reviewed in this pdf http://www.wisdom.weizmann.ac.il/~harel/papers/Divide%20and%20Conquer.pdf

But suppose you've decided that you still want nothing to do with factor analysis and are dead set on running some kind of supervised, "sequential" selection process. In my view, the most important issue is less about finding a post-hoc performance metric (Dunn Index) and more about identifying a suitable proxy -- a dependent variable -- to even make this approach possible. This decision is entirely a function of your judgement and SME status wrt your data. There are no "best practices," much less easy answers for this and given how you've described your data, no small challenge.

Once that decision is made, then there are literally hundreds of possible variable selection solutions to choose from. Variable selection is a topic area on which every statistician and their brother has published a paper. Your preferred approach seems to be "sequential forward selection" is fine.

It's worth noting that supervised learning models exist which fold in a cluster solution as part of the algorithm. Examples of this include the large and highly flexible approaches known as latent class models. The essence of LC models is that they are two stage: in stage one a DV is defined and a regression model is built. In the second stage, any heterogeneity in the residual output from the model -- a single latent vector -- is partitioned into latent "classes." There's an overview of LC modeling in this CV discussion here ... Latent class multinomial logit model doubt

Hope this helps.

Solved – What statistical test for cluster analysis results should I use

Let me know if I am understanding your question correctly. Your data do not have labels so you perform Gaussian clustering. And you want to perform hypothesis testing to check, using these clusters as "labels", if your data differ significantly?

It seems like you want to treat these clusters as different levels of a single "factor" (in ANOVA speak). If the equal-variance between clusters assumption holds, you can then proceed to perform a MANOVA (where the response is the 35-dimensional feature vector of your data points). But since these assumptions are violated, you cannot do the traditional MANOVA.

If I'm understanding this correctly, you can perform a permutation-based MANOVA. Anderson 2001 describes this approach. Essentially it applies the sum-of-squares (or any other dissimilarity measure) metric to your data points, generate distribution of F-ratios and compare that to your permuted data to obtain pvalue/confidence interval.

If you are using MATLAB, there is an implementation in the Fathom Toolbox.

Best Answer

Related Solutions

Solved – Feature selection for clustering problems

Solved – What statistical test for cluster analysis results should I use

Related Question