Random Generation – How to Set Seed Before Each Code Block or Once Per Project?

random-generation

It is standard advice to set a random seed so that results can be reproduced. However, since the seed is advanced as pseudo-random numbers are drawn, the results could change if any piece of code draws an additional number.

At first glance, version control looks to be a solution to this, as it would at least allow you to go back and reproduce the version extant when you wrote down the results in your notes or paper. However, since it only takes one draw to mess things up, if you update R the results could change as well.

I realize that this is probably only problematic in rare cases, but I'm curious if there are any best practices here. This is something I've been struggling with in my own work.

Best Answer

It depends how you will run the code or if there is any code that is somewhat stochastic in that it draws random numbers in a random way. (An example of this is the permutation tests in our vegan package where we only continue permuting until we have amassed enough data to know whether a result is different from the stated Type I error tacking into account a Type II error rate.) Although even that shouldn't affect the draws...

If the final script will only ever be run as a batch job or in its entirety and there are no stochastic draws from the pseudo-random number generator then it is safe to set a seed at the top of the script and run it in its entirety.

If you want to step through code, perhaps rerunning blocks then you need a set.seed() call before each function call that will draw from the pseudo-random number generator.

For my scientific papers I routinely go super defensive and set seeds prior to each code chunk; this allows for updates to the script at a later date that might need to be inserted into the existing script at any point - say to respond to reviewers' or co-authors' comments.

Your results will hopefully not be contingent on a particular set of pseduo-random values, so the issue is being able to reproduce the exact values stated in a report or paper. Even though you might be super defensive and set a seed on each code chunk, you still may need to recreate the exact installation --- R version and package versions so recording those details is essential. To be extra safe you'll need to keep previous R versions and packages around for specific projects/papers. Indeed, many people do this.

Related Solutions

Solved – Where in R code should I use set.seed() function (specifically, before shuffling or after)

You use set.seed to reproduce your results. Therefore you have to use this function before you generate the random variables. This means:

> set.seed(1)
> sample(c(1,2,3,4,5,6,7,8,9,10),4)
[1] 3 4 5 7
> sample(c(1,2,3,4,5,6,7,8,9,10),4)
[1] 3 9 8 5

If you do the same again, you get the same numbers.

> set.seed(1)
> sample(c(1,2,3,4,5,6,7,8,9,10),4)
[1] 3 4 5 7

If you execute your code again, you will get in your first case the same output, and in the second one a different.

EDIT: To make it clear: set.seed means to initialize your generator of random variables.

Monte Carlo – Should a New or Same Seed Be Used for Each Simulation Run?

For pseudo-random numbers, the seed is not there to "ensure randomness". In fact, it is quite the opposite: the seed is there to ensure reproducibility. You set the seed if you want to be able to run the same pseudo-random Monte Carlo experiments again and get the exact same results. For example if your scripts will be archived with an eventual publication.

It does not make sense to set the seed 10,000 times. You could set it once at the beginning of each of your 1,000 runs, but if they are quick and all run in a single loop, then setting the seed once at the beginning should be fine.

Best Answer

Related Solutions

Solved – Where in R code should I use set.seed() function (specifically, before shuffling or after)

Monte Carlo – Should a New or Same Seed Be Used for Each Simulation Run?

Related Question