Solved – How to define a posterior probability of y given x when the model is not probabilistic

classificationclusteringk-meansposteriorprobability

Suppose we have a very simple online k-means where each new data-point is assigned to its nearest center (the mean is updated incrementally). Each center (cluster) is labelled with the most common label of data-points assigned to that cluster. In this special configuration: is it possible to compute a sort of "posterior probability"? I.e., can the posterior probability of a class label $y$ given a data-point $x$ ($P(y|x)$) just be $1/\text{distance}(x, m_y)$, where $m_y$ is a center labelled with $y$ which is nearest to $x$?

Best Answer

Since you can view k-means as a sort of impoverished Mixture of Normals (specifically with 0 variance), I'd be tempted to use the density function of the Normal distribution if you need a probabilistic metric. If you're willing to assume equal variances in all the clusters you can ignore the variance in the density function, and normalise by the distance across all clusters (you could also include a prior probability of the cluster as the fraction of points assigned to it as well).

It's not pretty and it's not really theoretically justified, but it may suffice.

Related Question