Solved – Clustering sequence on similarity using percentage identity matrix

clusteringhierarchical clusteringmodel-based-clusteringr

I have a set of 400 nucleotide sequences and want to cluster them on basis of similarity. For clustering, I am expecting a similarity <=45% among members of a cluster. Also, there will be a few sequences that do not show similarity to any other member. Is there any clustering approach that allow us to set a cut-off for relation (similarity) between members? and can keep the members with very low similarity to a "unclustered" set?

I have generated the percentage identity matrix (400 x 400) using clustal-omega, and using this matrix for clustering by "affinity-propagation" approach but not getting good results.

p.s. I have had used "cd-hit" and "uclust" already but they are not recommended for cases when expected sequence similarity is below 70%.

Link to my question on BioStar – https://www.biostars.org/p/147913/

Bade

Best Answer

Hierarchical clusterings are commonly cut at a threshold level of similarity, such as 45%.

Furthermore, you can use DBSCAN, with epsilon set to 45%.

Plenty of more choices, if you keep on looking.

Related Question