Solved – Best way to visualize data with two keys and many rows in R (heatmap, mosaic plot, treemap, ggplot)

data visualizationggplot2heatmapr

I have been wrestling with the problem of creating some type of visualization for my data.

I have two keys (used in data table to create the data) that are intervals. I constructed the intervals from what used to be continuous variables in my data set. It is important for me to be able to construct these manually because I do not want the same breaks in the buckets. This has been completed to my satisfaction.

Specifics about the data: The data is in the form of a data frame with rownames that are 187 unique intervals. (I also have a data table that has these values in separate columns if that would help in visualization.) I have 4 categories which have corresponding counts for how many observations fall within the interval.

for example it looks like this:

 (interval1_1, interval2_1) :   number_category1 ... number_category4
 .
 .
 .
 (interval1_187, interval2_187): number_category1 ... number_category4

I have attempted to create a heat map; however, there are too many rows for my heat map to be able to read the intervals. I have also looked into mosaic plots and treemaps. I would be happy to restructure the data in a way that is better suitable for visualization.

Right now, the best I can do is a visualization using ggplot.

ggplot(df, aes(x=interval_1, y=interval_2)) + geom_jitter(aes(size=N, col=four_types),position=position_jitter(.2,.2)) + scale_size(range=c(2,9))

The above code creates a plot with points of a size corresponding to how big the count is for that particular interval. While not all the intervals contain each of the four_types, effectively there are 4 points of different color based on their type that are slightly jittered around each corresponding interval that are on one axis each.

In addition to my question of whether anyone can think of a better way to visualize this data, I have not been able to create as many splits in the size variable as I want to for the plot formed from the ggplot code above. Right now, it uses only four different sized points. When I try to use scale_size_continuous, the first three-four points are not different enough in size from each other, so effectively, I still get four points even when I supply numbers for the breaks. I have played around with the range inside of scale and scale_size_continuous, but I haven't quite gotten anything to work.

Thanks!

Best Answer

With that much data (187 x 187 x 4 categories), I think the main options are 4 heat maps/scatterplots using color for the count (or log count if skewed). Here's a heat map sized so each square is 3x3 pixels.

enter image description here

Another option is two-levels of graphs:

  1. A coarser view with fewer interval groups (10x10 instead of 187x187)
  2. Zoomed in views of those blocks which are of interest of are selected on demand

Here's an example of the coarser view, which allows all four category counts to be summarized for that block. I've used a line plus background color for the summary, but it could be a treemap or other view instead.

enter image description here