Solved – Where in R code should I use set.seed() function (specifically, before shuffling or after)

neural networksrrandom-generation

I've been using the set.seed() function to reproduce same results on multiple runs. However, I don't understand where to use the function.
the reason I'm asking this is because if I use the function before shuffling the data, it is producing different results than if I use the function after shuffling.

In my case, I have first used set.seed() after shuffle here :

data <- read.csv("F:/winequality-white.csv")
#shuffle
data <- data[sample(1:nrow(data)),]
set.seed(123)
data$quality = as.factor(data$quality)
div = 0.7
index <- 1:nrow(data)
position <- sample(index, trunc(nrow(data) * div))
test <- data[-position,]
train <- data[position,]
result = data[nrow(test),]
result$pred = -1
model_nnet <- nnet(as.factor(quality) ~ ., data=train, size=10, maxit=1000)
pred<- predict(model_nnet, test, type="class")

It gives the result

    predicted
true   3   4   5   6   7   9
   3   0   1   3   1   0   0
   4   0   5  27  17   1   0
   5   1   6 259 155   3   0
   6   0   3 152 431  79   2
   7   0   0   5 164  98   1
   8   0   0   1  26  25   0
   9   0   0   0   2   2   0

And on running using set.seed() after shuffle

data <- read.csv("F:/winequality-white.csv")
set.seed(123)    
#shuffle
data <- data[sample(1:nrow(data)),]
data$quality = as.factor(data$quality)
div = 0.7
index <- 1:nrow(data)
position <- sample(index, trunc(nrow(data) * div))
test <- data[-position,]
train <- data[position,]
result = data[nrow(test),]
result$pred = -1
model_nnet <- nnet(as.factor(quality) ~ ., data=train, size=10, maxit=1000)
pred<- predict(model_nnet, test, type="class")
table(true=test$quality, predicted=pred)

produces the result

    predicted
true   4   5   6   7
   3   1   5   1   0
   4   7  27  15   1
   5   3 234 179   4
   6   2 135 480  57
   7   1  10 169  92
   8   0   0  33  13
   9   0   0   0   1

As you can see the results and accuracy are different. I just want to know where exactly to use the set.seed() function to produce the best results. In the above case, it should be either after I shuffle the data or before that.
Thanks in advance

Best Answer

You use set.seed to reproduce your results. Therefore you have to use this function before you generate the random variables. This means:

> set.seed(1)
> sample(c(1,2,3,4,5,6,7,8,9,10),4)
[1] 3 4 5 7
> sample(c(1,2,3,4,5,6,7,8,9,10),4)
[1] 3 9 8 5

If you do the same again, you get the same numbers.

> set.seed(1)
> sample(c(1,2,3,4,5,6,7,8,9,10),4)
[1] 3 4 5 7

If you execute your code again, you will get in your first case the same output, and in the second one a different.

EDIT: To make it clear: set.seed means to initialize your generator of random variables.

Related Question