Data Visualization – Best Plot in R for Count Data with Broad Range and Low Frequency

categorical datadata visualizationfrequencyggplot2r

I have a data set that looks like this, in reality there are about twenty categories with a count of 1:

Things Count
Cars 500
Trucks 250
Jeeps 17
Planes 2
Foot 1
Oranges 1
Plasters 1
Grapes 1
Tablets 1
Pillows 1

I have tried plotting:

Pie charts (too many low frequency entries – with a second zoomed in one on the low counts this looks silly as they all have counts of one).
Bar charts: Not a nice aesthetic with lots of 1 counts in the tail.
Treemaps (using treemapify) – again too many low frequency tiles to many don't have their name within them.

I would like to know how people visualize such data in a presentable way. Online searches don't yield many helpful responses. Maybe I'll just stick with a table?

This is a toy example, in reality I am trying to visualize patients that have been given a genetic diagnosis for a set disease to the "things" are genes and then their counts. Many people have a rare gene, hence the low counts, but a few genes make up a larger proportion of the diagnoses. I want to tell this story in the most visually appealing way possible. I have been on the R Gallery, but none of the plots really seemed to deal with count data with such a broad range.

Best Answer

After playing with this for a bit I think 'just use a table' might be the best approach, but I tried a few of R's methods for showing this kind of information so I'm including as an answer in case it's useful.

Here's the data in R (data.table package)

library(data.table)
dat <- fread(text="Cars 500
Trucks 250
Jeeps 17
Planes 2
Foot 1
Oranges 1
Plasters 1
Grapes 1
Tablets 1
Pillows 1", header=FALSE) |> setNames(c("Category", "Frequency"))

Edit: you can add labels directly onto a waffle plot by extracting the locations of the plotted tiles. So:

# create a waffle plot
library(waffle)
library(data.table)
w1 <- ggplot(dat[sample(nrow(dat)),], aes(fill=Category, values=Frequency)) + 
  geom_waffle(n_rows=50, col=NA) + 
  theme_void() + 
  scale_fill_brewer(palette="Paired") + 
  theme(legend.position = "none")

# extract the locations of the plotted tiles
d1 <- ggplot_build(w1)$data |> as.data.table()

# take the average position for each group
# (relies on each group having a different fill colour)
regiondata <- d1[, .(x=mean(x), y=mean(y)), by=fill]

# now draw the plot, and add the labels 
w1 + geom_text(aes(
  label = Category,
  x = regiondata$x,
  y = regiondata$y,
  size = Frequency
)) +
  scale_size_binned()

You can play with the size scale or maybe with ggrepel to get better sizing and placement of the labels.

Be careful that the labels are plotted in the right order, this relies on the order in the ggplot data being the same as the order the waffle plot is built in. So far this seems to be the case but it's worth checking.

First I tried the waffle and treemapify (as you did) R packages:

library(waffle)
ggplot(dat, aes(fill=Category, values=Frequency)) + 
  geom_waffle(n_rows=50, col=NA) + theme_void()

ggplot(dat, aes(area = Frequency, fill = Category, label = Category)) +
  geom_treemap(layout="srow") +
  geom_treemap_text(layout="srow")+ 
  scale_fill_brewer(palette="Paired")

Neither of these works great when the distributions are so skewed, but they do convey the sense of area pretty well. If you can figure out how to label the waffle plot without using a legend it might work well.

I looked at facet_zoom from ggforce but I don't think it works in this case. So I tried to re-implement it as a pair of stacked bar charts with varying widths. Again not sure if it works particularly to convey your message, but it's possible you could adapt it.

This works by duplicating the dataset, with one part only having the low frequencies, then plotting two stacked bars side by side with some annotation to indicate the zooming and variable widths to make the relative areas correct:

dat[, prop:=Frequency/sum(Frequency)]
datlong=rbindlist(list(full=dat,zoomed=dat[Frequency<100]), idcol="part")
datlong[, sum:=sum(Frequency), by=part]
datlong[, grandsum:=sum(Frequency)]

ggplot(datlong) + aes(x=part, y=Frequency, fill=reorder(Category, Frequency)) + 
  geom_col(position="fill",width=width,col="black",aes(width=sum/grandsum)) + 
  geom_text(aes(label=sprintf("%s (%d)",Category,Frequency)), 
            position = position_fill(vjust=0.5),
            data=datlong[(part=="full" & Frequency>100)]) + 
  geom_text(aes(label=sprintf("%s (%d)",Category,Frequency)), 
            position = position_fill(vjust=0.5),hjust=-0.1,
            data=datlong[(part=="zoomed")]) + 
  theme_minimal()+
  theme(legend.position = "none",
        axis.text = element_blank()) + 
  annotate(x=1.5,xend=1.99, 
           y=1-sum(datlong[part=="zoomed", prop]),
           yend=0, geom="segment", lty="dashed")+
  annotate(x=1.5,xend=2.0, 
           y=1,yend=1, geom="segment", lty="dashed")+ 
  labs(x=NULL, y=NULL)

But I don't honestly think it's better than a table or a description in text would be.

Best Answer

Related Solutions

R – Alternative to Sieve and Mosaic Plots for Contingency Tables

Solved – Best way to visualize data with two keys and many rows in R (heatmap, mosaic plot, treemap, ggplot)

Related Question