Solved – Confidence level for clusters

clustering

I have a sample, set of outcomes of some random variable. I divide it into "clusters", using some determinate approach. One of the clusters considered to be "correct", usually it is the one that has the most number of outcomes in it, but not always. I want to mark each cluster with some "confidence level", based on the number of outcomes it has. The formulae should depend on the number of outcomes in each cluster totally, and in the current one.

Say, if the outcome numbers, for each cluster sequentially, are: 220, 31, 28, 21, then the first cluster should have very high confidence level, something close to 100%, but not exactly 100%. But if the distribution is 113, 110, 106, 103, then all the confidences should be almost equal – and not even close to 100% – since there are several similar clusters, we cannot mark each of them as "probably correct".

I cannot use just relative proportion to the maximum, because it will cause highest cluster to have always exactly 100% confidence level, and it shouldn't be.

Is there some approach, or formulae, that gives such estimations? Thank you.

Best Answer

You could use the conditional probabilities for the outcomes to be the correct outcomes under a prior assumption that the correct outcome occurs with some probability $p$ and all incorrect outcomes occur with probability $q=(1-p)/(n-1)$, where $n$ is the number of different outcomes. Then the probability for outcomes $a_i$ with sum $\sum_ia_i=S$ if answer $k$ is correct is proportional to $p^{a_k}q^{S-a_k}$, so you could assign confidence levels

$$c_k=\frac{p^{a_k}q^{S-a_k}}{\sum_i p^{a_i}q^{S-a_i}}=\frac{p^{a_k}q^{-a_k}}{\sum_i p^{a_i}q^{-a_i}}\;.$$

You can choose the parameter $p$ according to your needs. For $p=q=1/n$, you'll get $c_k=1/n$ (which makes sense, since if you assume that people are just guessing, no amount of clustering will raise your confidence in one of the outcomes). For $p$ near $1$, you'll get sharply peaked confidence levels even for moderate differences in the outcome counts. In the limit $p\to1$, $q\to0$ the confidence level for the outcome with the highest count will go to $1$ and the others will go to $0$, since you can multiply through by the lowest number of factors of $q$ and the other terms in the sum go to zero.

If you want the confidence level in your first example to be noticeably different from $100\%$, you'd have to choose $p$ quite close to $1/n=25\%$. Here are some values for your examples:

$$ \begin{array}{|c|c|c|c|c|c|} p&q&220&31&28&21\\ \hline 0.253&0.249& 0.879&0.043&0.041&0.037\\ \hline 0.256&0.248&0.994&0.002&0.002&0.002\\ \hline 0.259&0.247&1.000&0.000&0.000&0.000 \end{array} $$

$$ \begin{array}{|c|c|c|c|c|c|} p&q&113&110&106&103\\ \hline 0.253&0.249& 0.270&0.258&0.242&0.230\\ \hline 0.256&0.248&0.291&0.264&0.233&0.212\\ \hline 0.259&0.247&0.312&0.270&0.224&0.194\\ \hline 0.265&0.245& 0.354&0.280&0.204&0.162\\ \hline 0.280&0.240&0.458&0.288&0.156&0.098\\ \hline 0.310&0.230&0.632&0.258&0.078&0.032 \end{array} $$

Related Solutions

Solved – How to fit mixture model for clustering

Here is script for using mixture model using mcluster.

X <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3), rnorm(200,65, 3), rnorm(200,80,5))
Y <- c(rnorm(1000, 30, 2))
plot(X,Y, ylim = c(10, 60), pch = 19, col = "gray40")

require(mclust)
xyMclust <- Mclust(data.frame (X,Y))
plot(xyMclust)

enter image description here

In a situation where there are less than 5 clusters:

X1 <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3),  rnorm(200,80,5))
Y1 <- c(rnorm(800, 30, 2))
xyMclust <- Mclust(data.frame (X1,Y1))
plot(xyMclust)

enter image description here

 xyMclust4 <- Mclust(data.frame (X1,Y1), G=3)
plot(xyMclust4)

enter image description here

In this case we are fitting 3 clusters. What if we fit 5 clusters ?

xyMclust4 <- Mclust(data.frame (X1,Y1), G=5)
plot(xyMclust4)

It can force to make 5 clusters.

enter image description here

Also let's introduce some random noise:

X2 <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3),  rnorm(200,80,5), runif(50,1,100 ))
Y2 <- c(rnorm(850, 30, 2))
xyMclust1 <- Mclust(data.frame (X2,Y2))
plot(xyMclust1)

mclust allows model-based clustering with noise, namely outlying observations that do not belong to any cluster. mclust allows to specify a prior distribution to regularize the fit to the data. A function priorControl is provided in mclust for specifying the prior and its parameters. When called with its defaults, it invokes another function called defaultPrior which can serve as a template for specifying alternative priors. To include noise in the modeling, an initial guess of the noise observations must be supplied via the noise component of the initialization argument in Mclust or mclustBIC.

enter image description here

The other alternative would be to use mixtools package that allows you to specify mean and sigma for each components.

X2 <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3),
    rnorm(200,80,5), rpois(50,30))
Y2 <- c(rnorm(800, 30, 2), rpois(50,30))
df <- cbind (X2, Y2)
require(mixtools)
out <- mvnormalmixEM(df, lambda = NULL, mu = NULL, sigma = NULL,
   k = 5,arbmean = TRUE, arbvar = TRUE, epsilon = 1e-08,  maxit = 10000, verb = FALSE)
plot(out, density = TRUE, alpha = c(0.01, 0.05, 0.10, 0.12, 0.15),  marginal = TRUE)

enter image description here

Best Answer

Related Solutions

Solved – How to fit mixture model for clustering

Related Question