Exploratory Data Analysis – Quick Glance at a Dataset for Insights

correlationdata miningdata visualizationexploratory-data-analysis

Please pardon my ignorance, but…

I keep finding myself in a situation, where I'm faced with a bunch of new data I managed to find. This data usually looks something like this:

Date     Number1  Number2  Category1  Category2
20120125      11      101        Dog      Brown
20120126      21       90        Cat      Black
20120126      31      134        Cat      Brown
(...)

Usually at first glance I can't really tell if there are any trends here. The correlations between the various columns may not be very significant, but I would be delighted if I didn't have to manually create a plot for every possible combination of columns/categories.

Is there a tool out there which would accept a table of data along with information which columns should be treated as numbers, dates and categories and then proceed to plot:

  • correlations between each two numerical columns
  • correlations between each two numerical columns, with separate trend lines for each category
  • each number column as a time series,
  • each number column as a time series, separated by category,
  • etc.

In the end this would generate a large number of plots, most of which would show only noise. Ideally, the tool could score the plots by correlation and in the end display a slideshow starting with the highest scoring plots. This would be a very imperfect, but useful first glance at the dataset.

So? Is there a tool everyone uses for this and I just don't know about it, or is this something we need to make?

Best Answer

@Ondrej and @Michelle have provided some good information here. I wonder if I can contribute by addressing some points not mentioned elsewhere. I wouldn't beat yourself up about not being able to glean much from the data in tabular form, tables are generally not a very good way to present information (cf., Gelman et al., Turning Tables into Graphs). On the other hand, asking for a tool that will automatically generate all of the right graphs to help you explore a new data set is almost like asking for a tool that will do your thinking for you. (Don't take that the wrong way, I recognize your question makes clear that you aren't going that far; I just mean that there will never really be such a tool.) A nice discussion that is related to this can be found here.

These things having been said, I wanted to talk a little about the kinds of plots that you might want to use to explore your data. The plots listed in the question would be a good start, but we might be able to optimize that a little. To start with, making "a large number of plots" correlating pairs of variables might not be ideal. A scatterplot only displays the marginal relationship between two variables. Important relationships can often be hidden in some combination of multiple variables. So the first way to beef up this approach is to make a scatterplot matrix that displays all pairwise scatterplots simultaneously. Scatterplot matrices can be enhanced in various ways: E.g., they can be combined with univariate kernel density plots of each variable's distribution, different markers / colors can be used to plot different groups, and possible nonlinear relationships can be assessed by overlaying a loess fit. The scatterplot.matrix function in the car package in R can do all of these things nicely (an example can be seen halfway down the page linked above).

However, while scatterplot matrices are a good start, they are still only displaying the marginal projections. There are a few ways to try to move beyond this. One is to explore 3-dimensional plots using the rgl package in R. Another approach is to use conditional plots; coplots can help with relationships amongst 3 or 4 variables simultaneously. An especially useful approach is to use a scatterplot matrix interactively (albeit, this will require more effort to learn), e.g. by 'brushing'. Brushing allows you to highlight a point or points in one frame of a matrix and those points will simultaneously be highlighted in all of the other frames. By moving the brush around, you can see how all of the variables change together. UPDATE: Another possibility that I had forgotten to mention is to use a parallel coordinates plot. This has a disadvantage in not making your response variable distinct, but could be useful, for example, in examining inter-correlations amongst your X variables.

I also want to commend you for examining your data sorted by date collected. Although data are always gathered over time, people don't always do this. Plotting a line graph is nice, but I would suggest you supplement that with graphs of autocorrelations and partial autocorrelations. In R, the functions for these are acf and pacf respectively.

I recognize that all of this doesn't quite answer your question in the sense of giving you a tool that will make all the plots for you automatically, but one implication is that you wouldn't actually have to make as many plots as you fear, e.g., a scatterplot matrix is just one line of code. In addition, in R, it should be possible to write a function / some reusable code for yourself that would partly automate some of this (e.g., I can imagine a function that takes in a list of variables and a date-ordering, sorts them, pops up a new window for each with line, acf, and pacf plots).

Related Question