Solved – Clustering genes in a time course experiment

clusteringmachine learningmicroarrayr

I have seen a few queries on clustering in time series and specifically on clustering, but I don't think they answer my question.

Background: I want to cluster genes in a time course experiment in yeast. There are four time points say: t1 t2 t3 and t4 and total number of genes G. I have the data in form a matrix M in which the columns represent the treatments (or time points) t1 t2 t3 and t4 and the rows represent the genes. Therefore, M is a Gx4 matrix.

Problem: I want to cluster the genes which behave the same across all time points t1 t2 t3 and t4 as well as within a particular time point ti , where i is in {1, 2, 3, 4} (In case we cannot do both the clusterings together, the clustering within a time point is more important than clustering across time points). In addition to this, I also want to draw a heatmap.

My Solution:
I use the R code below to obtain a heatmap as well as the clusters using hclust function in R (performs hierarchical clustering with euclidean distance)

    row.scaled.expr <- (expr.diff - rowMeans(expr.diff)) / rowSds(expr.diff)

    breaks.expr <- c(quantile(row.scaled.expr[row.scaled.expr < 0],
                               seq(0,1,length=10)[-9]), 0,
                               quantile(row.scaled.expr[row.scaled.expr > 0],
                               seq(0,1,length=10))[-1] )


    blue.red.expr <- maPalette(low = "blue", high = "red", mid = "white",
                     k=length(breaks.expr) - 1)

    pdf("images/clust.pdf",
         height=30,width=20,pointsize=20)
    ht1 <- heatmap.2(row.scaled.expr, col = blue.red.expr, Colv = FALSE, key = FALSE, 
      dendrogram = "row", scale = "none", trace = "none",
      cex=1.5, cexRow=1, cexCol=2,
      density.info = "none", breaks = breaks.expr, 
      labCol = colnames(row.scaled.expr),
      labRow="",
      lmat=rbind( c(0, 3), c(2,1), c(0,4) ), lhei=c(0.25, 4, 0.25 ),
      main=expression("Heat Map"),
      ylab="Genes in the Microarray",
      xlab="Treatments"
      )
    dev.off()

I recently discovered hopach package in Bioconductor which can be used to estimate the number of clusters. Previously, I was randomly assigning the number of bins for the heatmap and cutting the tree at an appropriate height to get a pre-specified number of clusters.

Possible Problems in my solution:

  1. I may be not clustering the genes within a particular treatment and clustering genes only across treatments or vice versa.
  2. There may be better ways of obtaining a heatmap for the pattern I want to see (similar genes within a treatment and across treatments).
  3. There may be better visualization methods which I am not aware of.

Note:

  1. csgillespie (moderator) has a more general document on his website in which he discusses all the aspects of time course analysis (including heatmaps and clustering). I would appreciate if you can point me to an articles which describe heatmaps and clustering in detail.

  2. I have tried the pvclust package, but it complains that M is singular and then it crashes.

Best Answer

It seems you just want to make a fair standard analysis, so I am not a best person to answer your question; yet I would suggest you to dive deeper into Bioconductor; it has a lot of useful stuff, nevertheless finding what you want is painful. For instance Mfuzz package looks promising.

Related Question