Solved – Alternative to Otsu for dividing data into two groups

clusteringmachine learning

I need to be able to automatically divide a dataset into two clusters. There are heuristic reasons to expect the data to have two clusters which would be visually clear if one were to plot the data and in cases I have tested this has panned out. I am familiar with otsu's method for turning a grayscale image into a black and white only image and it seems like one possible approach. My knowledge of it comes from image processing, and I expect there are more standard statistical methods that existed long before that but I just don't know about them. What alternatives are there, particularly that might provide a number that qualifies as a rank of "how divided" the two clusters are and can also be used to determine cases when the clusters fail to exist.

Note
After looking into the Jenks algorithm proposed in the answer, I found that the classInt package in R apparently has a number of such algorithms. I post a note from its documentation to expand on the answer below. I have no idea how well these perform in practice, I post them just because of the variety of possibilities and because being in R makes them easy to try out for yourself.

The fixed style permits a "classIntervals" object to be specified with given breaks, set in the
fixedBreaks argument; the length of fixedBreaks should be n+1; this style can be used to insert
rounded break values.

The sd style chooses breaks based on pretty of the centred and scaled variables, and may have a
number of classes different from n; the returned par= includes the centre and scale values.

The equal style divides the range of the variable into n parts.

The pretty style chooses a number of breaks not necessarily equal to n using pretty, but likely
to be legible; arguments to pretty may be passed through ….

The quantile style provides quantile breaks; arguments to quantile may be passed through ….

The kmeans style uses kmeans to generate the breaks; it may be anchored using set.seed; the
pars attribute returns the kmeans object generated; if kmeans fails, a jittered input vector containing
rtimes replications of var is tried — with few unique values in var, this can prove necessary;
arguments to kmeans may be passed through ….

The hclust style uses hclust to generate the breaks using hierarchical clustering; the pars attribute
returns the hclust object generated, and can be used to find other breaks using getHclustClassIntervals;
arguments to hclust may be passed through ….

The bclust style uses bclust to generate the breaks using bagged clustering; it may be anchored
using set.seed; the pars attribute returns the bclust object generated, and can be used to find other
breaks using getBclustClassIntervals; if bclust fails, a jittered input vector containing rtimes
replications of var is tried — with few unique values in var, this can prove necessary; arguments
to bclust may be passed through ….

The fisher style uses the algorithm proposed by W. D. Fisher (1958) and discussed by Slocum et
al. (2005) as the Fisher-Jenks algorithm; added here thanks to Hisaji Ono.

The jenks style has been ported from Jenks’ Basic code, and has been checked for consistency
with ArcView, ArcGIS, and MapInfo (with some remaining differences); added here thanks to
Hisaji Ono; note that the sense of interval closure is reversed from the other styles, and in this
implementation has to be right-closed – use cutlabels=TRUE downstream for clarity.

Best Answer

Have a look at natural breaks optimization.

https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization

The term "clustering" is mostly used for multidimensional data.

Stay away from k-means. It's popular, but usually not appropriate for 1d data. 1d data can be handled much more clever, as it obviously is ordered. K-means algorithms (Lloyd, MacQueen) do not use the orderedness, and will test various non-sensical (non-continuous) combinations.

But here is another quite easy method: do a kernel density estimation. Locate local minima, and use these for splitting your data.

Related Solutions

Solved – Separate time series data into two trends

One approach is to make a rough initial assignment of each value to a group, then iteratively improve the assignments by fitting a model separately to each group and reassigning each value to the fit for which it is the smaller standardized residual. (Although this could be improved by jackknifing--that is, by systematically removing each value from the data, fitting the remaining values in both groups, and performing the reassignment, with so much data the extra work would likely make no difference anyway.)

Let's use the posted data as an example. Here is an implementation of the incremental improvement function in R. It just regresses TT against EST, thereby fitting straight lines; this regression can be replaced by any model one pleases, such as ARIMA times series fits.

improve <- function(x, y, i, method=rlm) {
  # `i` indicates which fit should be used.
  library(MASS) # rlm()
  d <- data.frame(x=x, y=y)
  fit.1 <- method(y ~ x, data=d, subset=(i))
  fit.0 <- method(y ~ x, data=d, subset=(!i))
  #
  # Re-assign the data according to relative nearness.
  #
  p.1 <- predict(fit.1, d, se.fit=TRUE)
  p.0 <- predict(fit.0, d, se.fit=TRUE)

  delta.1 <- (p.1$fit - y) / p.1$se.fit
  delta.0 <- (p.0$fit - y) / p.0$se.fit
  j <- abs(delta.1) < abs(delta.0)
  return(j)
}

Here are the results, using color to show the final assignments of the data into the two groups. The R code to produce them appears afterwards.

Clearly a better model could be used--the top fit is not good--but the separation into two groups still looks pretty good anyway, in part because the bottom fit is pretty good.

#
# Obtain the data.
# (Data are in a CSV file formatted like this:
#    TT  DATE TIME
#    741 1/5/2012 15:30
#    662 1/5/2012 15:31
# ....)
#
df <- read.table("f:/temp/data.txt", header=TRUE, as.is=TRUE)
df$t <-sapply(strsplit(df$TIME, ":"), function(x) as.numeric(x) %*% c(1, 1/60))
#
# Begin by assigning all the maxima to the same group.
# (This works because many times have multiple observations. Otherwise, a
# windowed approach might work.)
#
x <- unique(df$t) #$ (prevent an SE bug from indenting the code)
y <- sapply(x, function(z) max(df$TT[abs(df$t-z)*60 < 1/2]))
d <- merge(df, data.frame(t=x, t.max=y))
j <- d$TT==d$t.max
#
# Iteratively improve the assignments.
#
i <- j * 0
n.iter <- 20
while(sum(i!=j) > 0 && n.iter > 0) {
  n.iter <- n.iter-1; i <- j; j <- improve(d$t, d$TT, i, method=lm)
}
# Polish with robust fits
while(sum(i!=j) > 0 && n.iter > 0) {
  n.iter <- n.iter-1; i <- j; j <- improve(d$t, d$TT, i)
}
#
# Plot the results.
#
par(mfrow=c(1,1))
plot(df$t, df$TT, xlab="Hour", ylab="Y", col="Gray")
plot(d$t, d$TT, xlab="Hour", ylab="Y")
points(d$t[i], d$TT[i], col="Red", pch=19, cex=0.75)
abline(lm(TT ~ t, data=d, subset=(i))$coeff, col="Gray") #$ 
abline(lm(TT ~ t, data=d, subset=(!i))$coeff, col="Gray")

R Clustering – Updates on Cubic Clustering Criterion Using R

I figured it out. The solution I chose is to rewrite and extract what NbClust was doing but to exclude the dist matrix call and everything else that I did not need. I check my custom CCC function against the actual output to be sure that the output is the same:

> NbClust(iris[-5], min.nc=3, max.nc=3, index="ccc", method="kmeans")
$All.index
[1] 37.6701

$Best.nc
Number_clusters     Value_Index 
         3.0000         37.6701 


> CCC(iris[-5], 3)
[1] 37.67012

The measures are the same.

Here's the full function for anyone else interested in CCC in R.

As some may notice, this function also calculates other disgnostics like c("scott", "rubin", "marriot", "friedman"). I only needed the CCC for my purposes but the others can also be extracted:

CCC <- function(data, nc) {
  Indices.WBT <- function(x,cl,P,s,vv) 
{
  n <- dim(x)[1]
  pp <- dim(x)[2]
  qq <- max(cl)
  z <- matrix(0,ncol=qq,nrow=n)
  clX <- as.matrix(cl)

  for (i in 1:n)
    for (j in 1:qq)
    {
      z[i,j]==0
      if (clX[i,1]==j) 
      {z[i,j]=1}
    }

  xbar <- solve(t(z)%*%z)%*%t(z)%*%x
  B <- t(xbar)%*%t(z)%*%z%*%xbar
  W <- P-B
  marriot <- (qq^2)*det(W)
  trcovw <- sum(diag(cov(W)))
  tracew <- sum(diag(W))
  if(det(W)!=0)
    scott <- n*log(det(P)/det(W))
  else {cat("Error: division by zero!")}
  friedman <- sum(diag(solve(W)*B))
  rubin <- sum(diag(P))/sum(diag(W))


  R2 <- 1-sum(diag(W))/sum(diag(P))
  v1 <- 1
  u <- rep(0,pp)
  c <- (vv/(qq))^(1/pp)
  u <- s/c
  k1 <- sum((u>=1)==TRUE)
  p1 <- min(k1,qq-1)
  if (all(p1>0,p1<pp))
  {
    for (i in 1:p1)
      v1 <- v1*s[i]
    c <- (v1/(qq))^(1/p1)
    u <- s/c
    b1 <- sum(1/(n+u[1:p1]))
    b2 <- sum(u[p1+1:pp]^2/(n+u[p1+1:pp]),na.rm=TRUE)
    E_R2 <- 1-((b1+b2)/sum(u^2))*((n-qq)^2/n)*(1+4/n)
    ccc <- log((1-E_R2)/(1-R2))*(sqrt(n*p1/2)/((0.001+E_R2)^1.2))
  }else 
  {
    b1 <- sum(1/(n+u))
    E_R2 <- 1-(b1/sum(u^2))*((n-qq)^2/n)*(1+4/n)
    ccc <- log((1-E_R2)/(1-R2))*(sqrt(n*pp/2)/((0.001+E_R2)^1.2))
  }
  results <- list(ccc=ccc,scott=scott,marriot=marriot,trcovw=trcovw,tracew=tracew,friedman=friedman,rubin=rubin)
  return(results)
}


nc <- nc
jeu1 <- as.matrix(data)
numberObsBefore <- dim(jeu1)[1]
jeu <- na.omit(jeu1) # returns the object with incomplete cases removed 
nn <- numberObsAfter <- dim(jeu)[1]
pp <- dim(jeu)[2]    
TT <- t(jeu)%*%jeu   
sizeEigenTT <- length(eigen(TT)$value)
eigenValues <- eigen(TT/(nn-1))$value 

for (i in 1:sizeEigenTT) 
{
  if (eigenValues[i] < 0) {
    #cat(paste("There are only", numberObsAfter,"nonmissing observations out of a possible", numberObsBefore ,"observations."))
    stop("The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.")
  } 
}
s1 <- sqrt(eigenValues)
ss <- rep(1,sizeEigenTT)
for (i in 1:sizeEigenTT) 
{
  if (s1[i]!=0) 
    ss[i]=s1[i]
}
vv <- prod(ss)  

set.seed(1)  
cl1 <- kmeans(jeu,nc)$cluster
TT <- t(jeu)%*%jeu 
Indices.WBT(x=jeu, cl=cl1, P=TT,s=ss,vv=vv)$ccc
}

Best Answer

Related Solutions

Solved – Separate time series data into two trends

R Clustering – Updates on Cubic Clustering Criterion Using R

Related Question