Solved – Confidence level for clusters

clustering

I have a sample, set of outcomes of some random variable. I divide it into "clusters", using some determinate approach. One of the clusters considered to be "correct", usually it is the one that has the most number of outcomes in it, but not always. I want to mark each cluster with some "confidence level", based on the number of outcomes it has. The formulae should depend on the number of outcomes in each cluster totally, and in the current one.

Say, if the outcome numbers, for each cluster sequentially, are: 220, 31, 28, 21, then the first cluster should have very high confidence level, something close to 100%, but not exactly 100%. But if the distribution is 113, 110, 106, 103, then all the confidences should be almost equal – and not even close to 100% – since there are several similar clusters, we cannot mark each of them as "probably correct".

I cannot use just relative proportion to the maximum, because it will cause highest cluster to have always exactly 100% confidence level, and it shouldn't be.

Is there some approach, or formulae, that gives such estimations? Thank you.

Best Answer

You could use the conditional probabilities for the outcomes to be the correct outcomes under a prior assumption that the correct outcome occurs with some probability $p$ and all incorrect outcomes occur with probability $q=(1-p)/(n-1)$, where $n$ is the number of different outcomes. Then the probability for outcomes $a_i$ with sum $\sum_ia_i=S$ if answer $k$ is correct is proportional to $p^{a_k}q^{S-a_k}$, so you could assign confidence levels

$$c_k=\frac{p^{a_k}q^{S-a_k}}{\sum_i p^{a_i}q^{S-a_i}}=\frac{p^{a_k}q^{-a_k}}{\sum_i p^{a_i}q^{-a_i}}\;.$$

You can choose the parameter $p$ according to your needs. For $p=q=1/n$, you'll get $c_k=1/n$ (which makes sense, since if you assume that people are just guessing, no amount of clustering will raise your confidence in one of the outcomes). For $p$ near $1$, you'll get sharply peaked confidence levels even for moderate differences in the outcome counts. In the limit $p\to1$, $q\to0$ the confidence level for the outcome with the highest count will go to $1$ and the others will go to $0$, since you can multiply through by the lowest number of factors of $q$ and the other terms in the sum go to zero.

If you want the confidence level in your first example to be noticeably different from $100\%$, you'd have to choose $p$ quite close to $1/n=25\%$. Here are some values for your examples:

$$ \begin{array}{|c|c|c|c|c|c|} p&q&220&31&28&21\\ \hline 0.253&0.249& 0.879&0.043&0.041&0.037\\ \hline 0.256&0.248&0.994&0.002&0.002&0.002\\ \hline 0.259&0.247&1.000&0.000&0.000&0.000 \end{array} $$

$$ \begin{array}{|c|c|c|c|c|c|} p&q&113&110&106&103\\ \hline 0.253&0.249& 0.270&0.258&0.242&0.230\\ \hline 0.256&0.248&0.291&0.264&0.233&0.212\\ \hline 0.259&0.247&0.312&0.270&0.224&0.194\\ \hline 0.265&0.245& 0.354&0.280&0.204&0.162\\ \hline 0.280&0.240&0.458&0.288&0.156&0.098\\ \hline 0.310&0.230&0.632&0.258&0.078&0.032 \end{array} $$

Related Question