Solved – How to increase longer term reproducibility of research (particularly using R and Sweave)

project-managementrreproducible-research

Context:
In response to an earlier question about reproducible research Jake wrote

One problem we discovered when
creating our JASA archive was that
versions and defaults of CRAN packages
changed. So, in that archive, we also
include the versions of the packages
that we used. The vignette based
system will probably break as folks
change their packages (not sure how to
include extra packages within the
package that is the Compendium).

Finally, I wonder about what to do
when R itself changes. Are there ways
to produce, say, a virtual machine
that reproduces the entire
computational environment used for a
paper such that the virtual machine is
not enormous?

Question:

  • What are good strategies for ensuring that reproducible data analysis is reproducible in the future (say, five, ten, or twenty years after publication)?
  • Specifically, what are good strategies for maximising ongoing reproducibility when using Sweave and R?

This seems to be related to the issue of ensuring that a reproducible data analysis project will run on someone else's machine with slightly different defaults, packages, etc.

Best Answer

At some level, this becomes impossible. Consider the case of the famous Pentium floating point bug: you not only need to conserve your models, your data, your parameters, your packages, all external packages, the host system or language (say, R) as well as the OS ... plus potentially the hardware it all ran on. Now consider that some results may be simulation based and required a particular cluster of machines...

That's just a bit much for being practical.

With that said, I think more pragmatic solutions of versioning your code (and maybe also your data) in revisions control, storing versions of all relevant software and making it possible to reproduce the results by running a single top-level script may be a "good enough" compromise.

Your mileage may vary. This also differs across disciplines or industry. But remember the old saw about the impossibility of foolproof systems: you merely create smarter fools.