Solved – How to keep exploratory analyses of large datasets in check

exploratory-data-analysisproject-management

When I start an exploratory analysis on a large data set (many samples, many variables), I often find myself with hundreds of derived variables, and tonnes of different plots, and no real way to keep track of what's going where. Code ends up like spaghetti, because there's no direction from the start…

Are there any recommended methods for keeping an exploratory analysis neat and tidy? In particular, how do you deal multiple branches of exploration (including the ones that were dead-ends), and with different versions of plots?

For reference, I'm working on geoscientific data (many variables over time, sometimes also over space). I usually work with Python or R, and store everything in git, and have been trying out the IPython Notebook as well. However, it would be good if answers were somewhat general and useful for people in all fields, with other types of (large?) data.

Best Answer

I think that frequently, the tendency to feel like you've gone down a rabbit hole with exploratory analyses is due to losing sight of the substantive question(s) you're asking. I do it myself, occasionally, and then have to remind myself what my goal(s) are. For example, am I trying to build a specific model, or evaluate the adequacy of an existing one? Am I looking for evidence of problems with the data (i.e., forensic data analysis)? Or, is this in the early stages of analysis, where I am investigating specific questions informally (e.g., is there a relationship between two variables?) before moving on to develop a formal model? In sum, if you catch yourself cranking out plots and tables but can't state clearly what your immediate goal is or why that plot/table is relevant, then you know you're getting pulled along by the activity (instead of being in control of it).

I try to approach exploratory data analysis like I do writing, whether that be writing a program or writing an article. In either case, I wouldn't start without making an outline first. That outline can change (and frequently does), of course, but to start writing without one is inefficient, and often yields a poor final product.

WRT organization, each analyst has to find a workflow that works for him or her—doing so is IMO more important than trying to follow rigidly someone else's workflow (though it is always helpful to get ideas from what others are doing). If you're working programmatically (i.e., writing code that can be run to generate/regenerate a set of results) and checking your work into git, then you're already miles ahead of many in this regard. I suspect that you may just need to spend some time organizing your code, and for that, I would suggest following your outline. For example, keep your analysis files relatively short and targeted, so that each answers one specific question (e.g., diagnostic plots for a specific regression model). Organize these into subdirectories at one or two levels, depending on the size and complexity of the project. In this way, the project becomes self-documenting; a list view of the directories, subdirectories and files (together with the comment at the top of each file) should, in theory, reproduce your outline.

Of course, in a large project, you might also have code that does data cleaning and management, code you've written to estimate a certain type of model, or other utilities you've written, and these won't fit within the substantive outline for your data analysis, so they should be organized in a different part of your project folder.

Update: After posting this, I realized that I didn't directly address your question about "dead ends." If you really decide that an entire set of analyses is of no value, then if you're working in git, you can always delete the corresponding file(s) with a commit message like "Abandoned this line of analysis because it wasn't productive." Unlike crumpling up what you've written and throwing it in the trash, you can always go back to what you did later on, if desired.

However, I think you'll find that if you proceed from an outline to which you've given some thought, you'll have fewer so-called dead-ends. Instead, if you spend time investigating a worthwhile and relevant question—even if this leads to a null finding or doesn't turn out like you anticipated—you probably still want to keep a record of what you've done and the outcome (at a minimum, so that you don't make the mistake of repeating this later on). Just move these to the bottom of your outline, in a sort of "Appendix."

Best Answer

Related Solutions

Solved – R package for visualizing and exploring large datasets

Solved – Determining probability distribution for datasets with missing values

Related Question