Random Generation – How to Set Seed Before Each Code Block or Once Per Project?

random-generation

It is standard advice to set a random seed so that results can be reproduced. However, since the seed is advanced as pseudo-random numbers are drawn, the results could change if any piece of code draws an additional number.

At first glance, version control looks to be a solution to this, as it would at least allow you to go back and reproduce the version extant when you wrote down the results in your notes or paper. However, since it only takes one draw to mess things up, if you update R the results could change as well.

I realize that this is probably only problematic in rare cases, but I'm curious if there are any best practices here. This is something I've been struggling with in my own work.

Best Answer

It depends how you will run the code or if there is any code that is somewhat stochastic in that it draws random numbers in a random way. (An example of this is the permutation tests in our vegan package where we only continue permuting until we have amassed enough data to know whether a result is different from the stated Type I error tacking into account a Type II error rate.) Although even that shouldn't affect the draws...

If the final script will only ever be run as a batch job or in its entirety and there are no stochastic draws from the pseudo-random number generator then it is safe to set a seed at the top of the script and run it in its entirety.

If you want to step through code, perhaps rerunning blocks then you need a set.seed() call before each function call that will draw from the pseudo-random number generator.

For my scientific papers I routinely go super defensive and set seeds prior to each code chunk; this allows for updates to the script at a later date that might need to be inserted into the existing script at any point - say to respond to reviewers' or co-authors' comments.

Your results will hopefully not be contingent on a particular set of pseduo-random values, so the issue is being able to reproduce the exact values stated in a report or paper. Even though you might be super defensive and set a seed on each code chunk, you still may need to recreate the exact installation --- R version and package versions so recording those details is essential. To be extra safe you'll need to keep previous R versions and packages around for specific projects/papers. Indeed, many people do this.

Related Question