Solved – Choosing the “Correct” Seed for Reproducible Research/Results

errormachine learningmodelingpredictive-modelsreproducible-research

I have created this post to spark a discussion around an idea I had regarding the choice of the seed number used to replicate the results of a statistical model. Here is some background on how I came up with this idea and how it may be applied.

I have a data set, called data, which has 506 observations, 13 predictor variables, and 1 continuous response variable. I split the data into a training and test set by randomly sampling 380 indexes without replacement, storing the observations in the rows equal to the 380 indexes from the data dataframe into a dataframe called train, and storing the remaining 126 observations into a dataframe called test.

Once I created the training and test sets, I built a linear model using the glm function, predicted the outcomes of the test set using the predict function, and computed the MSE between the test set's predicted values and its actual values. In this instance, I obtained an MSE of 16.65786. I left the code here for the night as it was time to go home.

The R code for this is:

library(MASS)
data <- Boston

index <- sample(x = 1:nrow(data), size = round(nrow(data)*0.75), replace = FALSE)
train <- data[index, ]
test <- data[-index, ]

lm1.fit <- glm(medv ~ ., data = train)

pr.lm1 <- predict(lm1.fit, test)
(MSE.lm1 <- sum((pr.lm1 - test$medv)^2) / nrow(test))

The following day, I realized that I should have set a seed so that the random sample of indexes (rows) used to create the training and test sets are the same each time the code is run. So I chose a seed of 1 and reran the code above. This time I received an MSE of 28.72933, a huge difference from the MSE I obtained the night before. This sparked the question of what if the seed I chose was one that lead to an outlier case or that the unseeded result that I obtained was an outlier case. Maybe both results were outlier cases! So I did some investigating.

I used a for loop 10,000 times, each time getting a new random sample of indices, created a new training and test set, built a linear model, predicted outcomes for the test set, and calculated the MSE. I stored each of the 10,000 MSEs in a dataframe and computed the mean and median MSE. I also looked at the distribution of the MSEs which appeared to be approximately normal. I reran the forloop again just to make sure that the mean and median of the MSEs did not change much, they didn't.

The first 10,000 MSEs had a mean of 24.13 and a median of 23.51.
The second 10,000 MSEs had a mean of 24.01 and a median of 23.48.

The R code for this is:

error <- data.frame(err = rep(0, 10000))
for(i in 1:10000)
{
 index <- sample(x = 1:nrow(data), size = round(nrow(data)*0.75), 
                 replace = FALSE) 
 train <- data[index, ]
 test <- data[-index, ]

 #Linear Model 1 - No Data Cleaning
 lm1.fit <- glm(medv ~ ., data = train)
 summary(lm1.fit)

 pr.lm1 <- predict(lm1.fit, test)
 error[i, 1] <- sum((pr.lm1 - test$medv)^2) / nrow(test)
}
summary(error[, 1])

The last thing I did was loop through this same model building process until I found a seed that lead to an MSE in between the mean and the median of the 20,000 MSEs.

The R code for this is:

for(i in 1:10000)
{
 set.seed(i)
 index <- sample(x = 1:nrow(data), size = round(nrow(data)*0.75), 
 replace = FALSE) 
 train <- data[index, ]
 test <- data[-index, ]

 #Linear Model 1 - No Data Cleaning
 lm1.fit <- glm(medv ~ ., data = train)
 summary(lm1.fit)

 pr.lm1 <- predict(lm1.fit, test)
 error <- sum((pr.lm1 - test$medv)^2) / nrow(test)
 if(error > 23 & error < 24){print(i)}
}

I chose to use a seed of 47 which lead to an MSE of 23.63085. I believe that this method for obtaining a seed more accurately depicts a model's ability to accurately predict versus just randomly picking a seed or finding a seed which gives the best results (which I view as "cheating").

My question to all you statisticians out there is does this seem like a reasonable ideology / approach? Is there a more appropriate method out there? Is there a downside to picking a seed this way?

Thank you everyone in advance for your input!


[Scortchi]
Thanks – I was trying to be pithy & conflated two points: (1) Specifically the OP's re-inventing Monte-Carlo cross-validation; & (2) the sensitivity of the out-of-sample MSE estimate to the seed used shows that they should be using some form of re-sampling validation rather than a train/test split. – Scortchi

[Brandon]
@Scortchi So what you're saying is I should simply use the following code instead:

library(MASS)  
data <- Boston  

lm1.fit <- glm(medv ~ ., data = data)  

library(boot)  
set.seed(1)  
cvResult <- cv.glm(data = data, glmfit = lm1.fit, K = 10)  
cvResult$delta 

Which leads to a cv prediction error of 23.38074, very close to the prediction error obtained using my long method above. Am I correct?

Best Answer

The seed choice should not affect the ultimate result, otherwise you have an issue. Then why do we need a seed at all? The reason is mainly for debugging and trouble shooting.

What do I call an ultimate result? Suppose that you're analyzing a drug efficacy, and using Monte Carlo simulation to come up with some kind of a critical value. The seed shouldn't matter to your conclusion about whether the drug is efficient or not.

Consider this: what if you're using R and I'm suing Excel, how would I use your seed? I should be able to reproduce the ultimate result regardless of the differences in our software packages and platforms.

Of course, the seed will impact the exact value obtained for the critical value, e.g. 3.1278765876 instead of 3.128765987 with another seed, but this is not the ultimate result of your study. Suppose the test statistic is 20.5, then the difference in critical values between two seeds is not material. Actually, you should not even report the critical value beyond the third digit, just show 3.13, unless it's for debugging purposes.

However, if somehow the difference in fourth significant digit makes it or breaks it for this drug, then we have an issue. It means that we can't use this critical value, OR that we need to increase the number of simulations to nail the fourth digit and reduce the variation between seeds to a much smaller number.

Therefore, you need to set other parameters of simulation in such a way that seed wouldn't matter for the conclusion of your study. Do not "optimize" the seed in any way.

In my practice I had funny projects, where the client would insist on reporting all the calculated digits, e.g. $31,456,890.01. The simulations would yield maybe 1% precision, hence the number would be changing in second or third digit between different seeds. Moreover, the model itself had accuracy around 10% at best, so to a physicist I would have reported 3e7. However, accountants would complain that the number looks "rounded" which drives auditors nuts: it looks to them as if someone's "cooking the books." We ended up reporting the numbers to the cents, and storing the seeds so that the numbers would be reproducible upon audit requests.

Related Question