Solved – How to represent the probability of a point belonging to a cluster

clusteringdata visualizationscatterplot

I want to do a scatter plot with a two-dimensional dataset. Suppose I have only 3 clusters. Then, I could assign each cluster a color of these: red, green and blue. If soft-assignment was made, then each datapoint would have a certain probability of belonging to each cluster. One can make that clear visually plotting each point in the scatter plot with an RGB value of $[p_1,p_2,p_3]$, where $p_i$ is the probability of that point to belong to cluster $i$.

This works for 2 or 3 classes. But what if I had more than 3? Is there a way to represent these probabilities in an intuitive way, preserving the position of each sample in the 2D space? I'm using R to do the plots, if that gives any useful information.

Best Answer

In general, this is a challenging problem, especially given the constraint that the relative positions in 2D space should be retained.

In the absence of that constraint, I would recommend a stacked bar plot. With thin bars and a sorted dataset, colours can easily be used to indicate the probability of belonging to different clusters for a fairly substantial number of points. Plots such as these are common in population genetics and can convey a fair amount of useful information, such as in this example.

If we are to stick with the constraint of retaining relative positions in 2 dimensions, I can think of one solution that would work for modest-sized datasets with a small number of clusters. For these cases, you can plot each point as a small pie; the segments of the pie denote the probability of belonging to each cluster.

Here is a worked example using 3 clusters

# Loading required libraries
library(e1071)
library(ggplot2)
library(scatterpie)

# Generating data frame
dat <- data.frame(a = c(rnorm(50, mean = 10, sd = 3), 
                        rnorm(50, mean = 20, sd = 3),
                        rnorm(50, mean = 30, sd = 3)),
                  b = c(rnorm(50, mean = 10, sd = 5), 
                        rnorm(50, mean = 20, sd = 3),
                        rnorm(50, mean = 30, sd = 3)))

# Identifying clusters and calculating cluster probabilities using 
#  fuzzy c-means clustering
clustdat <- cmeans(dat, centers = 3)

# Adding cluster information to dataset
dat$clusters <- as.factor(clustdat$cluster)
dat$A <- clustdat$membership[,1]
dat$B <- clustdat$membership[,2]
dat$C <- clustdat$membership[,3]

# Plotting
ggplot() + geom_scatterpie(aes(a, b, group = clusters), 
                           data = dat, cols = LETTERS[1:3])

Note that this may be useful with >2 dimensions as well, by combining this with some sort of dimension reduction technique (for plotting - the clustering can be done in multidimensional space).

Related Solutions

R – Alternative to Sieve and Mosaic Plots for Contingency Tables

The book you described sounds like, 'Visualizing Categorical Data,' Michael Friendly. The plot described in the 1st chapter that seems to match your request was described as a type of conceptual model for visualizing contingency table data (loosely described by the author as a dynamic pressure model with observational density), and can be seen in the google preview for Ch 1. The book is geared towards SAS users.

A paper on the topic is referenced here: www.datavis.ca/papers/koln/kolnpapr.pdf

'Conceptual Models for Visualizing Contingency Table Data,' Michael Friendly .

enter image description here

*incidentally, the author is also listed as one of the authors of the vcd package (as it was specifically inspired by his book mentioned above) -- maybe you could ask him directly if there's a simple modification to one of the built in functions that's not readily apparent.

** The coloring scheme seems to relate the color blue with positive deviations from independence, and red for negative deviations. Although the red scheme makes sense in that context, maybe it would have been more apt to have used green to represent positive deviations.

http://www.datavis.ca/papers/asa92.html

Solved – How to visualize cluster data in a scatter way

Here's an example with the auto data that uses two rings whose areas are proportional to standard deviations, which is not quite what your want, but is fairly easy:

sysuse auto, clear
collapse (mean) price mpg (sd) sd_price = price sd_mpg = mpg, by(rep78)

tw (scatter price mpg [w=sd_price], ms(Oh)) (scatter price mpg [w=sd_mpg], ms(Oh)) (scatter price mpg, msymbol(none) mlabpos(0) mlabel(rep78)), legend(off)

bubbles

The missing group corresponds to "."

This way of plotting the data does not seems like a good idea, as it obscures some features of the data. For instance, you get the sense that SD of price is larger than SD of MPG, but for group 1, the former is 200 times the latter, though the bubbles appear the same size.

Best Answer

Related Solutions

R – Alternative to Sieve and Mosaic Plots for Contingency Tables

Solved – How to visualize cluster data in a scatter way

Related Question