Data Visualization – How to Determine When Interactive Data Visualization is Useful

data miningdata visualizationinteractive-visualization

While preparing for a talk I will give soon, I recently started digging into two major (Free) tools for interactive data visualization: GGobi and mondrian – both offer a great range of capabilities (even if they're a bit buggy).

I wish to ask for your help in articulating (both to myself, and for my future audience) When is it helpful to use interactive plots? Either for data exploration (for ourselves) and data presentation (for a "client")?

For when explaining the data to a client, I can see the value of animation for:

  • Using "identify/linking/brushing" for seeing which data point in the graph is what.
  • Presenting a sensitivity analysis of the data (e.g: "if we remove this point, here is what we will get)
  • Showing the effect of different groups in the data (e.g: "let's look at our graphs for males and now for the females")
  • Showing the effect of time (or age, or in general, offering another dimension to the presentation)

For when exploring the data ourselves, I can see the value of identify/linking/brushing when exploring an outlier in a dataset we are working on.

But other then these two examples, I am not sure what other practical use these techniques offer. Especially for our own data exploration!

It could be argued that the interactive part is good for exploring (For example) a different behavior of different groups/clusters in the data. But when (in practice) I approached such situation, what I tended to do was to run the relevant statistical procedures (and post-hoc tests) – and what I found to be significant I would then plot with colors clearly dividing the data to the relevant groups. From what I've seen, this is a safer approach then "wondering around" the data (which could easily lead to data dredging (were the scope of the multiple comparison needed for correction is not even clear).

I'd be very happy to read your experience/thoughts on this matter.

(this question can be a wiki – although it is not subjective and a well thought-out answer will gladly win my "answer" mark 🙂 )

Best Answer

In addition to linking quantitative or qualitative data to spatial patterns, as illustrated by @whuber, I would like to mention the use of EDA, with brushing and the various of linking plots together, for longitudinal and high-dimensional data analysis.

Both are discussed in the excellent book, Interactive and Dynamic Graphics for Data Analysis With R and GGobi, by Dianne Cook and Deborah F. Swayne (Springer UseR!, 2007), that you surely know. The authors have a nice discussion on EDA in Chapter 1, justifying the need for EDA to "force the unexpected upon us", quoting John Tukey (p. 13): The use of interactive and dynamic displays is neither data snooping, nor preliminary data inspection (e.g., purely graphical summaries of the data), but it is merely seen as an interactive investigation of the data which might precede or complement pure hypothesis-based statistical modeling.

Using GGobi together with its R interface (rggobi) also solves the problem of how to generate static graphics for intermediate report or final publication, even with Projection Pursuit (pp. 26-34), thanks to the DescribeDisplay or ggplot2 packages.

In the same line, Michael Friendly has long advocated the use of data visualization in Categorical Data Analysis, which has been largely exemplified in the vcd package, but also in the more recent vcdExtra package (including dynamic viz. through the rgl package), which acts as a glue between the vcd and gnm packages for extending log-linear models. He recently gave a nice summary of that work during the 6th CARME conference, Advances in Visualizing Categorical Data Using the vcd, gnm and vcdExtra Packages in R.

Hence, EDA can also be thought of as providing a visual explanation of data (in the sense that it may account for unexpected patterns in the observed data), prior to a purely statistical modeling approach, or in parallel to it. That is, EDA not only provides useful ways for studying the internal structure of the data at hand, but it may also help to refine and/or summarize statistical models applied on it. It is in essence what biplots allow to do, for example. Although they are not multidimensional analysis techniques per se, they are tools for visualizing results from multidimensional analysis (by giving an approximation of the relationships when considering all individuals together, or all variables together, or both). Factor scores can be used in subsequent modeling in place of the original metric to either reduce the dimensionality or to provide intermediate levels of representation.

Sidenote

At risk of being old-fashionned, I'm still using xlispstat (Luke Tierney) from time to time. It has simple yet effective functionalities for interactive displays, currently not available in base R graphics. I'm not aware of similar capabilities in Clojure+Incanter (+Processing).

Related Question