Solved – Visualizing multiple “histograms” (bar-charts)

barplotdata visualizationhistogrampca

I am having difficulties to select the right way to visualize data. Let's say we have bookstores that sells books, and every book has at least one category.

For a bookstore, if we count all the categories of books, we acquire a histogram that shows the number of books that falls into a specific category for that bookstore.

I want to visualize the bookstore behavior, I want to see if they favor a category over other categories. I don't want to see if they are favoring sci-fi all together, but I want to see if they are treating every category equally or not.

I have ~1M bookstores.

I have thought of 4 methods:

  1. Sample the data, show only 500 bookstore's histograms. Show them in 5 separate pages using 10×10 grid. Example of a 4×4 grid:

    multiple histograms 1

  2. Same as #1. But this time sort x axis values according to their count desc, so if there is a favoring it will be seen easily.

  3. Imagine putting the histograms in #2 together like a deck and showing them in 3D. Something like this:
    3D histogram

  4. Instead of using third axis suing color to represent colors, so using a heatmap (2D histogram):
    2D histogram
    If generally bookstores prefer some categories to others it will be displayed as a nice gradient from left to right.

Do you have any other visualization ideas/tools to represent multiple histograms?

Best Answer

As you have found out there are no easy answers to your question!

I presume that you interested in finding strange or different book stores? If this is the case then you could try things like PCA (see the wikipedia cluster analysis page for more details).

To give you an idea, consider this example. You have 26 bookshops (with names A, B,..Z). All bookshops are similar, except:

  1. Shop Z sells only a few History books.
  2. Shops O-Y sell more romance books than average.

A principal components plot highlights these shops for further investigation.

Here's some sample R code:

> d = data.frame(Romance = rpois(26, 50), Horror = rpois(26, 100), 
               Science = rpois(26, 75), History = rpois(26, 125))
> rownames(d) = LETTERS
#Alter a few shops
> d[15:25,][1] = rpois(11,150)
> d[26,][4] = rpois(1, 10)
#look at the data
> head(d, 2)
       Romance Horror Science History
 A      36    107      62     139
 B      47     93      64     118
> books.PC.cov = prcomp(d)
> books.scores.cov = predict(books.PC.cov)
# Plot of PC1 vs PC2
> plot(books.scores.cov[,1],books.scores.cov[,2],
       xlab="PC 1",ylab="PC 2", pch=NA)
> text(books.scores.cov[,1],books.scores.cov[,2],labels=LETTERS)

This gives the following plot:

PCA plot http://img265.imageshack.us/img265/7263/tmplx.jpg

Notice that:

  1. Shop z is an outlying point.
  2. The others shops form two distinct groups.

Other possibilities

You could also look at GGobi, I've never used it, but it looks interesting.