Solved – Select best set of binary variables for clustering known sample labels

binary datacategorical dataclusteringfeature selectionr

I have a set of samples, for which I know the "true groups". For this samples I have about 200 binary variables, I would like to know a method to select the subset of variables, that gives me a clustering as closer as possible of my known groups.

# sample labels
labelColors2 <-c("black", "black","black","black","black","black",     "blue","blue","blue","blue","green", "green",
"red","red","red","red","red","red","red","red","red","red","red","red")
# data matrix
library(RCurl)
x <- getURL("https://dl.dropboxusercontent.com/u/10712588/binMatrix")
tab3 <- read.table(text = x)
colLab <- function(n) {
if(is.leaf(n)) {
a <- attributes(n)
#clusMember a vector designating leaf grouping
#labelColors <- colors # a vector of colors for the above grouping
labCol <- labelColors2[clusMember[which(names(clusMember) == a$label)]]
attr(n, "nodePar") <- c(a$nodePar, list(lab.col = labCol,lab.cex=0.8))
}
n
}
mclust <- hclust(dist(tab3, method ="binary"))
dhc <- as.dendrogram(mclust)
clusMember <- cutree(mclust, k=24)
clusDendro <- dendrapply(dhc, colLab)
plot(clusDendro)

example of the true groups given by colours

The colors should be grouped, this is my actual way to access the goodness of clustering, visually, but I would like to know a feature selection technique.

thks in advance…

updating the question, I found the klaR::stepclass function, that should to what I want, or some similar implementation, but I did not find a work around yet.

fac <- as.factor(labelColors2)
mylda <- function(x, grouping) {
clust <- pam(dist(x, method="binary"), k=4,
    cluster.only = TRUE)
posterior <- matrix(0, 24, 4) 
colnames(posterior) <- c("black", "blue", "green", "red")
for(i in 1:nrow(posterior)) posterior[i, clust[i]] <- 1 
l <- list(class=grouping, posterior=posterior)
class(l) <- "foo"
return(l)

}

With the function above I can reproduce an output of my classification, similar to what klaR::ucpm needs, but I can't manage to run the function

sc_obj <- stepclass(x=tab3, grouping=fac, method="mylda", direction="forward")

Error in parse(text = x) : <text>:2:0: unexpected end of input
1: fac ~
  ^ 

Well, I think I had some improvement, I established a "fitness function", and with a random search (it is still running, I found a better clustering already

predict.foo <- function(x) x 
for(i in 1:1000000) {
s <- sample(1:ncol(tab3),sample(68:200,1)) 
cr <- ucpm(predict(mylda(tab3[,s], fac))$posterior, fac)$CR 
write.table(matrix(c(cr, s), nrow=1), "randonSearch.txt", append=TRUE,         row.names=FALSE, col.names=FALSE)
}

With this I'm monitoring the randonSearch.txt file with:

cut -d " " -f1 ../randonSearch.txt | grep 0.8

I already found a "Correctness Rate" of 0.833, check it out

enter image description here

I think there is still room for improvement, I'm thinking in a genetic algorithm…

Best Answer

It is hard to answer your question without knowledge of how many samples you have and how many features you want, but here is a quick and dirty solution that may work.

Draw random pairs of samples from your set and compute a derived feature vector in {-1, 1}^200, with +1 in positions where the two samples are the same and -1 where the two samples are different. Assign a label +1 if the two samples are from the same cluster and -1 if they are from different clusters. Keep drawing pairs of samples until you have a sizable number. You will now have a labeled data set of training examples.

Now run a feature selection algorithm for classification (of which there are many) for this classification problem. You might start with a simple method like using lars to fit a regression model and using the indices of the non-zero coefficients to pick you features.

Related Question