Solved – Best way to visualize attrition using R

data visualizationrsankey-diagram

Thru this site I've recently discovered Sankey Diagrams, a great way to visualize what is happening in a traditional flow chart.

Here is a good example of a Sankey Diagram by George M. Whitesides and George W. Crabtree,

Source; Don't Forget Long-Term Fundamental Research in Energy, Science 9 February 2007:Vol. 315. no. 5813, pp. 796 – 798.

After I realized that there was no Sankey R-package I found an R script online, unfortunately this script is quite raw and somewhat limited. With high hopes I asked for a Sankey R-package or a more mature function at stackoverflow, but to my surprise it seems as we do not have a mature function for building Sankey Diagrams in R.

After I posted a bounty Geek On Acid was kind enough to suggest a small hack on the existing script which made it work more or less for my specific purpose.

The improved R-script produced this diagram,
Geek On Acid's R-Sankey Diagram
Source; stackoverflow.com.

But, does the lack of a R package indicate that Sankey Diagrams isn't such an amazing way to visualize attrition using R in a data flow à la the one presented in the diagram above (see initial stackoverflow question for data and R code. Maybe there's a better way to visualize attrition.

What do you think is the best way to visualize attrition in a data flow using R?

Best Answer

I agree with @gung. The Sankey diagram you posted is, I think, a pretty good example of where the technique can help. While it is complicated, the context (energy input and output) is complex too and it is hard to think of a nicer way of visualizing the paths of inputs-to-outputs-acting-as-new-inputs across multiple categories of usage.

Now then, for the attrition example you posted, as others have noted it is not helpful to use a Sankey diagram. I think you need to post your full set of variables if you want a good recommendation on alternative visualizations though. If you simply want to show differences in attrition sources between sites and clinicians, a small-multiples series of dot plots may be the easiest for your audience to understand and for you to implement (see this example, where in your case the groups could be the sites, the elements within the groups would be the causes of attrition, and the horizontal axis would be 0-100%).

If the Sankey diagram is something you want to use, and you are willing to dabble in another high level language, there is a nice example (with code) on the gallery for the Python plotting package, matplotlib.

Related Solutions

Solved – the best way to visualize relationship between discrete and continuous variables

Below: The original plot may be misleading because the discrete nature of the variables makes the points overlap:

enter image description here

One way to work around it is to introduce some transparency to the data symbol:

enter image description here

Another way is to displace the location of the symbol mildly to create a smear. This technique is called "jittering:"

enter image description here

Both solutions will still allow you to fit a straight line to assess linearity.

R code for your reference:

x <- trunc(runif(200)*10)
y <- x * 2 + trunc(runif(200)*10)
plot(x,y,pch=16)
plot(x,y,col="#00000020",pch=16)
plot(jitter(x),jitter(y),col="#000000",pch=16)

Visualize-One-Sample-Test – How to Best Visualize One-Sample Test Results

Something like this?

One sample boxplot

Or were you after some interval for the median, like you get with notched boxplots (but suited to a one sample comparison, naturally)?

Here's an example of that:

enter image description here

This uses the interval suggested in McGill et al (the one in the references of ?boxplot.stats). One could actually use notches, but that might increase the chance that it is interpreted instead as an ordinary notched boxplot.

Of course if you need something to more directly replicate the signed rank test, various things can be constructed that do that, which could even include the interval for the pseudo-median (i.e. the one-sample Hodges-Lehmann location estimate, the median of pairwise averages).

Indeed, wilcox.test can generate the necessary information for us, so this is straightforward:

> wilcox.test(pd,mu=1.1,conf.int=TRUE)

    Wilcoxon signed rank test

data:  pd
V = 72, p-value = 0.5245
alternative hypothesis: true location is not equal to 1.1
95 percent confidence interval:
 0.94 1.42
sample estimates:
(pseudo)median 
        1.1775

and this can be plotted also:

boxp with signed rank interval for pseudomedian

[The reason the boxplot interval is wider is that the standard error of a median at the normal (which is the assumption underlying the calculation based off the IQR) tends to be larger than that for a pseudomedian when the data are reasonably normalish.]

And of course, one might want to add the actual data to the plot:

same plot with jittered strip chart under the interval

Z-value

R uses the sum of the positive ranks as its test statistic (this is not the same statistic as discussed on the Wikipedia page on the test).

Hollander and Wolfe give the mean of the statistic as $n(n+1)/4$ and the variance as $n(n+1)(2n+1)/24$.

So for your data, this is a mean of 60 and a standard deviation of 17.61 and a z-value of 0.682 (ignoring continuity correction)

The code I used to generate the fourth plot (from which the earlier ones can also be done by omitting unneeded parts) is a bit rough (it's mostly specific to the question, rather than being a general plotting function), but I figured someone might want it:

notch1len <- function(x) {
  stats <- stats::fivenum(x, na.rm = TRUE)
  iqr <- diff(stats[c(2, 4)])
  (1.96*1.253/1.35)*(iqr/sqrt(sum(!is.na(x))))
}

w <- notch1len(pd)
m <- median(pd)

boxplot(pd,horizontal=TRUE,boxwex=.4)

abline(v=1.1,col=8)
points(c(m-w,m+w),c(1,1),col=2,lwd=6,pch="|")

ci=wilcox.test(pd,mu=1.1,conf.int=TRUE)$conf.int                       #$
est=wilcox.test(pd,mu=1.1,conf.int=TRUE)$estimate

stripchart(pd,pch=16,add=TRUE,at=0.7,cex=.7,method="jitter",col=8)

points(c(ci,est),c(0.7,0.7,0.7),pch="|",col=4,cex=c(.9,.9,1.5))
lines(ci,c(0.7,0.7),col=4)

I may come back and post more functional code later.

Best Answer

Related Solutions

Solved – the best way to visualize relationship between discrete and continuous variables

Visualize-One-Sample-Test – How to Best Visualize One-Sample Test Results

Related Question