I want to do a scatter plot with a two-dimensional dataset. Suppose I have only 3 clusters. Then, I could assign each cluster a color of these: red, green and blue. If soft-assignment was made, then each datapoint would have a certain probability of belonging to each cluster. One can make that clear visually plotting each point in the scatter plot with an RGB value of $[p_1,p_2,p_3]$, where $p_i$ is the probability of that point to belong to cluster $i$.
This works for 2 or 3 classes. But what if I had more than 3? Is there a way to represent these probabilities in an intuitive way, preserving the position of each sample in the 2D space? I'm using R to do the plots, if that gives any useful information.
Best Answer
In general, this is a challenging problem, especially given the constraint that the relative positions in 2D space should be retained.
In the absence of that constraint, I would recommend a stacked bar plot. With thin bars and a sorted dataset, colours can easily be used to indicate the probability of belonging to different clusters for a fairly substantial number of points. Plots such as these are common in population genetics and can convey a fair amount of useful information, such as in this example.
If we are to stick with the constraint of retaining relative positions in 2 dimensions, I can think of one solution that would work for modest-sized datasets with a small number of clusters. For these cases, you can plot each point as a small pie; the segments of the pie denote the probability of belonging to each cluster.
Here is a worked example using 3 clusters
Note that this may be useful with >2 dimensions as well, by combining this with some sort of dimension reduction technique (for plotting - the clustering can be done in multidimensional space).