Solved – Clustering into ordered clusters

clusteringk-meansordinal-datar

In a research study I have a list of countries and data about them.

GDP
Population
Oil exports
Oil imports
Percentage of electricity produced with renewable energies
Urbanization
Percentage of GDP put into research in renewable energies

Now I would like to cluster these countries into three groups. In the end the group should be equivalent to:

Countries with high ecological standards
Countries with medium ecological standards
Countries with low ecological standards

These 3 categories are ordered. I would like to run the model additionally for 5 categories.

Which clustering algorithm would be most appropriate? Is k-means a good choice in this case? Which dangers arise when I use k-means? If you have some code in R solving a similar problem I would also be grateful.

Best Answer

This is, I think, not a problem for cluster analysis at all. Cluster analysis is unsupervised learning and you want some form of supervision.

What you seem to want is factor analysis, not cluster analysis, but maybe not FA either. If you already know what "ecological standards" means, you could derive a variable yourself. If not, then factor analysis of your existing variables might give you a factor that you think of as ecological standards.

That factor might break up into three groups, but it might not. I am not srue why you want this to break up into groups (and exactly three). I think it would be better treated, for almost all purposes, as a continuous variable.

But if you already know which country belongs in which group, then you have a classification task, which calls for other methods.

Related Solutions

Solved – Compute BIC clustering criterion (to validate clusters after K-means)

To calculate the BIC for the kmeans results, I have tested the following methods:

The following formula is from: [ref2]

The r code for above formula is:

  k3 <- kmeans(mt,3)
  intra.mean <- mean(k3$within)
  k10 <- kmeans(mt,10)
  centers <- k10$centers
  BIC <- function(mt,cls,intra.mean,centers){
    x.centers <- apply(centers,2,function(y){
      as.numeric(y)[cls]
    })
    sum1 <- sum(((mt-x.centers)/intra.mean)**2)
    sum1 + NCOL(mt)*length(unique(cls))*log(NROW(mt))
  }
#

the problem is when i using the above r code, the calculated BIC was monotone increasing. what's the reason?

[ref2] Ramsey, S. A., et al. (2008). "Uncovering a macrophage transcriptional program by integrating evidence from motif scanning and expression dynamics." PLoS Comput Biol 4(3): e1000021.

I have used the new formula from https://stackoverflow.com/questions/15839774/how-to-calculate-bic-for-k-means-clustering-in-r

BIC2 <- function(fit){
m = ncol(fit$centers)
    n = length(fit$cluster)
k = nrow(fit$centers)
    D = fit$tot.withinss
return(data.frame(AIC = D + 2*m*k,
                  BIC = D + log(n)*m*k))
}

This method given the lowest BIC value at cluster number 155.

using @ttnphns provided method, the corresponding R code as listed below. However, the problem is what the difference between Vc and V? And how to calculate the element-wise multiplication for two vectors with different length?

BIC3 <- function(fit,mt){
Nc <- as.matrix(as.numeric(table(fit$cluster)),nc=1)
Vc <- apply(mt,2,function(x){
    tapply(x,fit$cluster,var)
 })
V <- matrix(rep(apply(mt,2,function(x){
var(x)
}),length(Nc)),byrow=TRUE,nrow=length(Nc))
LL = -Nc * colSums( log(Vc + V)/2 ) ##how to calculate this? elementa-wise multiplication for two vectors with different length?
BIC = -2 * rowSums(LL) + 2*K*P * log(NRoW(mt))
return(BIC)
}

Solved – Clustering not producing even clusters

Most clustering algorithms prefer minimizing spread over cluster element count. I.e. they try to find clusters of small extent to cover everything, not clusters of even size.

I'm pretty sure there must be more algorithms, but the only one I have recently come across that tries to keep cluster sizes the same is this Tutorial:

http://elki.dbs.ifi.lmu.de/wiki/Tutorial/SameSizeKMeans

In your case, I guess hierarchical clustering would be better than k-means. But in hierarchical clustering, ensuring same-sized clusters seems quite hard. At some point, you will have to do some really bad cluster assignment if you want to fix cluster sizes.

This is most obvious if you have a data set with extremely well separated clusters, but different size. Say you have 100 instances that are $N(0;1)$ distributed, and 1000 instances that are $N(10;1)$ distributed. If you enforce the clusters to have the same size, the result will be really, really bad by any measure.

Best Answer

Related Solutions

Solved – Compute BIC clustering criterion (to validate clusters after K-means)

Solved – Clustering not producing even clusters

Related Question