R – Clustering Techniques for Large and Sparse Datasets in Bioinformatics

bioinformaticsclusteringr

I have a few datasets of "interactions" between pairs of elements like so:

element1 element2 1
element2 element3 1
element4 element5 1
...
element505535 element4 2

where the value in the 3rd column is the "strength" of interaction. Almost all of these strengths are "1." A strength of 1 means that this interaction was observed one time. A strength of 2 means 2x, etc. I have actually gone one step further and normalized all of my datasets by the total # of interactions observed in the dataset so that datasets interaction values can be compared.

There are 5-6 million interactions listed in each file and each dataset is obviously under-sampled since there are ~500k elements (making a square matrix of ~250 trillion positions).

I would like to cluster these datasets so that I can make statements about which types of elements tend to cluster with which other types elements. Obviously, robusticity of clustering will be a factor—but this is partially ameliorated by the fact that I will make biological replicates of the data.

I have tried a few different "naive" clustering approaches just to see what I could do easily with the data. I fully realize that these are problematic ways of clustering, either because they are not robust or because they rely on the data being very undersampled, but here is what I've done:

  1. Clustering elements together as long as there is at least one interaction between each element in the cluster and at least one other element in the cluster. When I do this, all the elements end up in a single cluster. This was important to do because it tells me that there are no pairs of elements that are totally isolated from the rest of the group.

  2. Finding "superclusters"—that is, clusters where every member of the cluster interacts with every other member of the cluster (e.g. a triangle for a cluster of 3 and a box with an X in the middle for a cluster of 4, etc). This yields almost exclusively clusters with 2 and 3 elements after about 10% of the data has been analyzed (this is still running).

I would love to be able to do some sort of hierarchical clustering using my "interaction strength" values as the distance measure between each pair of elements (unobserved interactions have a strength of 0). Does anyone know of a way to do HC on this sort of large, sparse data—or know of a clustering method that might be more appropriate? I've used R up until now.

Best Answer

Does this happen to be protein interaction data? Regardless, there are many algorithms you can try. I am the author of mcl, a clustering algorithm fairly often used in bioinformatics. You could easily try it, after installing, by given it the data in exaxctly the format you describe; the command line would be "mcl yourdata --abc" (the abc option tells mcl to expect this particular format). You need not and cannot specify the number of clusters, as cluster structure arises from the process computed by mcl in an emergent fashion. Protein interaction data can be quite voluminous and noisy -- and other data sources as well of course. Especially the presence of nodes of very high-degree (say with hundreds to thousands of neighbours) may obscure cluster structure. In such a scenario, in the absence of weights, it can be really worthwhile to try to reduce the input graph by e.g. a k-NN transform taking into account the two-step graph incidence relation. mcl also provides means to do this, as well a tool that lists various graph traits under a series of ever more stringent transformations, so that it is possible to choose such a transformation in an informed manner. As for hierarchical clustering, it is possible to achieve clusterings at different levels of granularity by varying mcl's so-called inflation parameter. You could for example supply three values (e.g. 1.4, 2, and 5) to obtain clustering at different levels of granularity. It is also possible to arrange clusterings thus obtained into a strictly hierarchical data structure.

Related Question