Solved – How to represent the probability of a point belonging to a cluster

clusteringdata visualizationscatterplot

I want to do a scatter plot with a two-dimensional dataset. Suppose I have only 3 clusters. Then, I could assign each cluster a color of these: red, green and blue. If soft-assignment was made, then each datapoint would have a certain probability of belonging to each cluster. One can make that clear visually plotting each point in the scatter plot with an RGB value of $[p_1,p_2,p_3]$, where $p_i$ is the probability of that point to belong to cluster $i$.

This works for 2 or 3 classes. But what if I had more than 3? Is there a way to represent these probabilities in an intuitive way, preserving the position of each sample in the 2D space? I'm using R to do the plots, if that gives any useful information.

Best Answer

In general, this is a challenging problem, especially given the constraint that the relative positions in 2D space should be retained.

In the absence of that constraint, I would recommend a stacked bar plot. With thin bars and a sorted dataset, colours can easily be used to indicate the probability of belonging to different clusters for a fairly substantial number of points. Plots such as these are common in population genetics and can convey a fair amount of useful information, such as in this example.

If we are to stick with the constraint of retaining relative positions in 2 dimensions, I can think of one solution that would work for modest-sized datasets with a small number of clusters. For these cases, you can plot each point as a small pie; the segments of the pie denote the probability of belonging to each cluster.

Here is a worked example using 3 clusters

# Loading required libraries
library(e1071)
library(ggplot2)
library(scatterpie)

# Generating data frame
dat <- data.frame(a = c(rnorm(50, mean = 10, sd = 3), 
                        rnorm(50, mean = 20, sd = 3),
                        rnorm(50, mean = 30, sd = 3)),
                  b = c(rnorm(50, mean = 10, sd = 5), 
                        rnorm(50, mean = 20, sd = 3),
                        rnorm(50, mean = 30, sd = 3)))

# Identifying clusters and calculating cluster probabilities using 
#  fuzzy c-means clustering
clustdat <- cmeans(dat, centers = 3)

# Adding cluster information to dataset
dat$clusters <- as.factor(clustdat$cluster)
dat$A <- clustdat$membership[,1]
dat$B <- clustdat$membership[,2]
dat$C <- clustdat$membership[,3]

# Plotting
ggplot() + geom_scatterpie(aes(a, b, group = clusters), 
                           data = dat, cols = LETTERS[1:3])

enter image description here Note that this may be useful with >2 dimensions as well, by combining this with some sort of dimension reduction technique (for plotting - the clustering can be done in multidimensional space).

Related Question