Solved – Best way to visualize attrition using R

data visualizationrsankey-diagram

Thru this site I've recently discovered Sankey Diagrams, a great way to visualize what is happening in a traditional flow chart.

Here is a good example of a Sankey Diagram by George M. Whitesides and George W. Crabtree,
Don't Forget Long-Term Fundamental Research in Energy
Source; Don't Forget Long-Term Fundamental Research in Energy, Science 9 February 2007:Vol. 315. no. 5813, pp. 796 – 798.

After I realized that there was no Sankey R-package I found an R script online, unfortunately this script is quite raw and somewhat limited. With high hopes I asked for a Sankey R-package or a more mature function at stackoverflow, but to my surprise it seems as we do not have a mature function for building Sankey Diagrams in R.

After I posted a bounty Geek On Acid was kind enough to suggest a small hack on the existing script which made it work more or less for my specific purpose.

The improved R-script produced this diagram,
Geek On Acid's R-Sankey Diagram
Source; stackoverflow.com.

But, does the lack of a R package indicate that Sankey Diagrams isn't such an amazing way to visualize attrition using R in a data flow à la the one presented in the diagram above (see initial stackoverflow question for data and R code. Maybe there's a better way to visualize attrition.

What do you think is the best way to visualize attrition in a data flow using R?

Best Answer

I agree with @gung. The Sankey diagram you posted is, I think, a pretty good example of where the technique can help. While it is complicated, the context (energy input and output) is complex too and it is hard to think of a nicer way of visualizing the paths of inputs-to-outputs-acting-as-new-inputs across multiple categories of usage.

Now then, for the attrition example you posted, as others have noted it is not helpful to use a Sankey diagram. I think you need to post your full set of variables if you want a good recommendation on alternative visualizations though. If you simply want to show differences in attrition sources between sites and clinicians, a small-multiples series of dot plots may be the easiest for your audience to understand and for you to implement (see this example, where in your case the groups could be the sites, the elements within the groups would be the causes of attrition, and the horizontal axis would be 0-100%).

If the Sankey diagram is something you want to use, and you are willing to dabble in another high level language, there is a nice example (with code) on the gallery for the Python plotting package, matplotlib.