I have seen a few queries on clustering in time series and specifically on clustering, but I don't think they answer my question.
Background: I want to cluster genes in a time course experiment in yeast. There are four time points say: t1 t2 t3 and t4 and total number of genes G. I have the data in form a matrix M in which the columns represent the treatments (or time points) t1 t2 t3 and t4 and the rows represent the genes. Therefore, M is a Gx4 matrix.
Problem: I want to cluster the genes which behave the same across all time points t1 t2 t3 and t4 as well as within a particular time point ti , where i is in {1, 2, 3, 4} (In case we cannot do both the clusterings together, the clustering within a time point is more important than clustering across time points). In addition to this, I also want to draw a heatmap.
My Solution:
I use the R code below to obtain a heatmap as well as the clusters using hclust
function in R (performs hierarchical clustering with euclidean distance)
row.scaled.expr <- (expr.diff - rowMeans(expr.diff)) / rowSds(expr.diff)
breaks.expr <- c(quantile(row.scaled.expr[row.scaled.expr < 0],
seq(0,1,length=10)[-9]), 0,
quantile(row.scaled.expr[row.scaled.expr > 0],
seq(0,1,length=10))[-1] )
blue.red.expr <- maPalette(low = "blue", high = "red", mid = "white",
k=length(breaks.expr) - 1)
pdf("images/clust.pdf",
height=30,width=20,pointsize=20)
ht1 <- heatmap.2(row.scaled.expr, col = blue.red.expr, Colv = FALSE, key = FALSE,
dendrogram = "row", scale = "none", trace = "none",
cex=1.5, cexRow=1, cexCol=2,
density.info = "none", breaks = breaks.expr,
labCol = colnames(row.scaled.expr),
labRow="",
lmat=rbind( c(0, 3), c(2,1), c(0,4) ), lhei=c(0.25, 4, 0.25 ),
main=expression("Heat Map"),
ylab="Genes in the Microarray",
xlab="Treatments"
)
dev.off()
I recently discovered hopach
package in Bioconductor which can be used to estimate the number of clusters. Previously, I was randomly assigning the number of bins for the heatmap and cutting the tree at an appropriate height to get a pre-specified number of clusters.
Possible Problems in my solution:
- I may be not clustering the genes within a particular treatment and clustering genes only across treatments or vice versa.
- There may be better ways of obtaining a heatmap for the pattern I want to see (similar genes within a treatment and across treatments).
- There may be better visualization methods which I am not aware of.
Note:
-
csgillespie (moderator) has a more general document on his website in which he discusses all the aspects of time course analysis (including heatmaps and clustering). I would appreciate if you can point me to an articles which describe heatmaps and clustering in detail.
-
I have tried the
pvclust
package, but it complains that M is singular and then it crashes.
Best Answer
It seems you just want to make a fair standard analysis, so I am not a best person to answer your question; yet I would suggest you to dive deeper into Bioconductor; it has a lot of useful stuff, nevertheless finding what you want is painful. For instance Mfuzz package looks promising.