Solved – Betweenness centrality applied to Amazon books graph

data visualizationgraph theoryinteractive-visualizationsurvey

I made a visualization of Amazon related products. Every link in visualization means two products are often bought together.

Now I'm applying various graph analysis techniques and am fascinated by the results. The biggest problem though is translating scientific terms to human language (please pardon me if it sounds snobberish).

For example, I calculated indegree centrality of nodes, and I called it "What's popular here". After all the most connected node means it's most often bought with all other products in a given graph.

Currently I'm working on betweenness centrality, and it yields quite interesting results, often contradicting degree centrality. But how would you interpret these results? Most important products?

For example, books graph of The art of R programming has the following top 3 nodes:

Indegree Centrality:

  1. The art of R programming – 18 outgoing edges
  2. R Cookbook (O'Reilly Cookbooks) – 14 outgoing edges
  3. Doing Bayesian Data Analysis: A Tutorial with R and BUGS – 10 outgoing edges

Betweenness Centrality:

  1. The art of R programming – centrality value of 1210
  2. What is a p-value anyway? – centrality value of 896
  3. Visualize This – centrality value of 784

The graph itself looks like this:

enter image description here

Best Answer

The difference between in-degree centrality and closeness centrality - or really any other centrality measure - the answer is that you're identifying different things.

Currently I'm working on betweenness centrality, and it yields quite interesting results, often contradicting degree centrality. But how would you interpret these results? Most important product?

I wouldn't necessarily call it the most important product. To my mind, a better description might be "core" products - those that regardless of what you purchase, it's relatively easy to end up at those books. Looking at your figure, the three most popular nodes are all near the center of your graph. They define places where, as soon as you move outside your sub-field, you have a somewhat higher level book that defines several groups.

Take Visualize This, as its the clearest illustration of this. Even if people don't jointly buy books about Tufte's theories and infographics about trivia, Visualize This is a common foundational book not very far removed from either group.

The same is true with the p-value book. No one jointly buys an "Idiots Guide to a Natural Science" book, a "Popular Statistics" book and a "Biostatistics" book. But all three can and do end up buying What Is a p-value anyway? Its a core book, useful to three different groups of readers.

Related Question