Solved – Select best set of binary variables for clustering known sample labels

binary datacategorical dataclusteringfeature selectionr

I have a set of samples, for which I know the "true groups". For this samples I have about 200 binary variables, I would like to know a method to select the subset of variables, that gives me a clustering as closer as possible of my known groups.

# sample labels
labelColors2 <-c("black", "black","black","black","black","black",     "blue","blue","blue","blue","green", "green",
"red","red","red","red","red","red","red","red","red","red","red","red")
# data matrix
library(RCurl)
x <- getURL("https://dl.dropboxusercontent.com/u/10712588/binMatrix")
tab3 <- read.table(text = x)
colLab <- function(n) {
if(is.leaf(n)) {
a <- attributes(n)
#clusMember a vector designating leaf grouping
#labelColors <- colors # a vector of colors for the above grouping
labCol <- labelColors2[clusMember[which(names(clusMember) == a$label)]]
attr(n, "nodePar") <- c(a$nodePar, list(lab.col = labCol,lab.cex=0.8))
}
n
}
mclust <- hclust(dist(tab3, method ="binary"))
dhc <- as.dendrogram(mclust)
clusMember <- cutree(mclust, k=24)
clusDendro <- dendrapply(dhc, colLab)
plot(clusDendro)

example of the true groups given by colours

The colors should be grouped, this is my actual way to access the goodness of clustering, visually, but I would like to know a feature selection technique.

thks in advance…

updating the question, I found the klaR::stepclass function, that should to what I want, or some similar implementation, but I did not find a work around yet.

fac <- as.factor(labelColors2)
mylda <- function(x, grouping) {
clust <- pam(dist(x, method="binary"), k=4,
    cluster.only = TRUE)
posterior <- matrix(0, 24, 4) 
colnames(posterior) <- c("black", "blue", "green", "red")
for(i in 1:nrow(posterior)) posterior[i, clust[i]] <- 1 
l <- list(class=grouping, posterior=posterior)
class(l) <- "foo"
return(l)

}

With the function above I can reproduce an output of my classification, similar to what klaR::ucpm needs, but I can't manage to run the function

sc_obj <- stepclass(x=tab3, grouping=fac, method="mylda", direction="forward")

Error in parse(text = x) : <text>:2:0: unexpected end of input
1: fac ~
  ^

Well, I think I had some improvement, I established a "fitness function", and with a random search (it is still running, I found a better clustering already

predict.foo <- function(x) x 
for(i in 1:1000000) {
s <- sample(1:ncol(tab3),sample(68:200,1)) 
cr <- ucpm(predict(mylda(tab3[,s], fac))$posterior, fac)$CR 
write.table(matrix(c(cr, s), nrow=1), "randonSearch.txt", append=TRUE,         row.names=FALSE, col.names=FALSE)
}

With this I'm monitoring the randonSearch.txt file with:

cut -d " " -f1 ../randonSearch.txt | grep 0.8

I already found a "Correctness Rate" of 0.833, check it out

enter image description here

I think there is still room for improvement, I'm thinking in a genetic algorithm…

Best Answer

It is hard to answer your question without knowledge of how many samples you have and how many features you want, but here is a quick and dirty solution that may work.

Draw random pairs of samples from your set and compute a derived feature vector in {-1, 1}^200, with +1 in positions where the two samples are the same and -1 where the two samples are different. Assign a label +1 if the two samples are from the same cluster and -1 if they are from different clusters. Keep drawing pairs of samples until you have a sizable number. You will now have a labeled data set of training examples.

Now run a feature selection algorithm for classification (of which there are many) for this classification problem. You might start with a simple method like using lars to fit a regression model and using the indices of the non-zero coefficients to pick you features.

Related Solutions

R – How to Select the Best Subset of Variables for Parsimonious Binary Logistic Regression Models

Variable selection without penalization is invalid.

Solved – How to use both binary and continuous variables together in clustering

You are right that k-means clustering should not be done with data of mixed types. Since k-means is essentially a simple search algorithm to find a partition that minimizes the within-cluster squared Euclidean distances between the clustered observations and the cluster centroid, it should only be used with data where squared Euclidean distances would be meaningful.

When your data consist of variables of mixed types, you need to use Gower's distance. CV user @ttnphns has a great overview of Gower's distance here. In essence, you compute a distance matrix for your rows for each variable in turn, using a type of distance that is appropriate for that type of variable (e.g., Euclidean for continuous data, etc.); the final distance of row $i$ to $i'$ is the (possibly weighted) average of the distances for each variable. One thing to be aware of is that Gower's distance isn't actually a metric. Nonetheless, with mixed data, Gower's distance is largely the only game in town.

At this point, you can use any clustering method that can operate over a distance matrix instead of needing the original data matrix. (Note that k-means needs the latter.) The most popular choices are partitioning around medoids (PAM, which is essentially the same as k-means, but uses the most central observation rather than the centroid), various hierarchical clustering approaches (e.g., median, single-linkage, and complete-linkage; with hierarchical clustering you will need to decide where to 'cut the tree' to get the final cluster assignments), and DBSCAN which allows much more flexible cluster shapes.

Here is a simple R demo (n.b., there are actually 3 clusters, but the data mostly look like 2 clusters are appropriate):

library(cluster)  # we'll use these packages
library(fpc)

  # here we're generating 45 data in 3 clusters:
set.seed(3296)    # this makes the example exactly reproducible
n      = 15
cont   = c(rnorm(n, mean=0, sd=1),
           rnorm(n, mean=1, sd=1),
           rnorm(n, mean=2, sd=1) )
bin    = c(rbinom(n, size=1, prob=.2),
           rbinom(n, size=1, prob=.5),
           rbinom(n, size=1, prob=.8) )
ord    = c(rbinom(n, size=5, prob=.2),
           rbinom(n, size=5, prob=.5),
           rbinom(n, size=5, prob=.8) )
data   = data.frame(cont=cont, bin=bin, ord=factor(ord, ordered=TRUE))
  # this returns the distance matrix with Gower's distance:  
g.dist = daisy(data, metric="gower", type=list(symm=2))

We can start by searching over different numbers of clusters with PAM:

  # we can start by searching over different numbers of clusters with PAM:
pc = pamk(g.dist, krange=1:5, criterion="asw")
pc[2:3]
# $nc
# [1] 2                 # 2 clusters maximize the average silhouette width
# 
# $crit
# [1] 0.0000000 0.6227580 0.5593053 0.5011497 0.4294626
pc = pc$pamobject;  pc  # this is the optimal PAM clustering
# Medoids:
#      ID       
# [1,] "29" "29"
# [2,] "33" "33"
# Clustering vector:
#  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
#  1  1  1  1  1  2  1  1  1  1  1  2  1  2  1  2  2  1  1  1  2  1  2  1  2  2 
# 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 
#  1  2  1  2  2  1  2  2  2  2  1  2  1  2  2  2  2  2  2 
# Objective function:
#     build      swap 
# 0.1500934 0.1461762 
# 
# Available components:
# [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
# [6] "clusinfo"   "silinfo"    "diss"       "call"

Those results can be compared to the results of hierarchical clustering:

hc.m = hclust(g.dist, method="median")
hc.s = hclust(g.dist, method="single")
hc.c = hclust(g.dist, method="complete")
windows(height=3.5, width=9)
  layout(matrix(1:3, nrow=1))
  plot(hc.m)
  plot(hc.s)
  plot(hc.c)

The median method suggests 2 (possibly 3) clusters, the single only supports 2, but the complete method could suggest 2, 3 or 4 to my eye.

Finally, we can try DBSCAN. This requires specifying two parameters: eps, the 'reachability distance' (how close two observations have to be to be linked together) and minPts (the minimum number of points that need to be connected to each other before you are willing to call them a 'cluster'). A rule of thumb for minPts is to use one more than the number of dimensions (in our case 3+1=4), but having a number that's too small isn't recommended. The default value for dbscan is 5; we'll stick with that. One way to think about the reachability distance is to see what percent of the distances are less than any given value. We can do that by examining the distribution of the distances:

windows()
  layout(matrix(1:2, nrow=1))
  plot(density(na.omit(g.dist[upper.tri(g.dist)])), main="kernel density")
  plot(ecdf(g.dist[upper.tri(g.dist)]), main="ECDF")

The distances themselves seem to cluster into visually discernible groups of 'nearer' and 'further away'. A value of .3 seems to most cleanly distinguish between the two groups of distances. To explore the sensitivity of the output to different choices of eps, we can try .2 and .4 as well:

dbc3 = dbscan(g.dist, eps=.3, MinPts=5, method="dist");  dbc3
# dbscan Pts=45 MinPts=5 eps=0.3
#        1  2
# seed  22 23
# total 22 23
dbc2 = dbscan(g.dist, eps=.2, MinPts=5, method="dist");  dbc2
# dbscan Pts=45 MinPts=5 eps=0.2
#         1  2
# border  2  1
# seed   20 22
# total  22 23
dbc4 = dbscan(g.dist, eps=.4, MinPts=5, method="dist");  dbc4
# dbscan Pts=45 MinPts=5 eps=0.4
#        1
# seed  45
# total 45

Using eps=.3 does give a very clean solution, which (qualitatively at least) agrees with what we saw from other methods above.

Since there is no meaningful cluster 1-ness, we should be careful of trying to match which observations are called 'cluster 1' from different clusterings. Instead, we can form tables and if most of the observations called 'cluster 1' in one fit are called 'cluster 2' in another, we would see that the results are still substantively similar. In our case, the different clusterings are mostly very stable and put the same observations in the same clusters each time; only the complete linkage hierarchical clustering differs:

  # comparing the clusterings
table(cutree(hc.m, k=2), cutree(hc.s, k=2))
#    1  2
# 1 22  0
# 2  0 23
table(cutree(hc.m, k=2), pc$clustering)
#    1  2
# 1 22  0
# 2  0 23
table(pc$clustering, dbc3$cluster)
#    1  2
# 1 22  0
# 2  0 23
table(cutree(hc.m, k=2), cutree(hc.c, k=2))
#    1  2
# 1 14  8
# 2  7 16

Of course, there is no guarantee that any cluster analysis will recover the true latent clusters in your data. The absence of the true cluster labels (which would be available in, say, a logistic regression situation) means that an enormous amount of information is unavailable. Even with very large datasets, the clusters may not be sufficiently well separated to be perfectly recoverable. In our case, since we know the true cluster membership, we can compare that to the output to see how well it did. As I noted above, there are actually 3 latent clusters, but the data give the appearance of 2 clusters instead:

pc$clustering[1:15]    # these were actually cluster 1 in the data generating process
# 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 
# 1  1  1  1  1  2  1  1  1  1  1  2  1  2  1 
pc$clustering[16:30]   # these were actually cluster 2 in the data generating process
# 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 
#  2  2  1  1  1  2  1  2  1  2  2  1  2  1  2 
pc$clustering[31:45]   # these were actually cluster 3 in the data generating process
# 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 
#  2  1  2  2  2  2  1  2  1  2  2  2  2  2  2

Best Answer

Related Solutions

R – How to Select the Best Subset of Variables for Parsimonious Binary Logistic Regression Models

Solved – How to use both binary and continuous variables together in clustering

Related Question