R Neural Networks – How to Train and Validate a Neural Network Model in R?

neural networksr

I am new to modeling with neural networks, but I managed to establish a neural network with all available data points that fits the observed data well. The neural network was done in R with the nnet package:

require(nnet)      
##33.8 is the highest value
mynnet.fit <- nnet(DOC/33.80 ~ ., data = MyData, size = 6, decay = 0.1, maxit = 1000)      
mynnet.predict <- predict(mynnet.fit)*33.80  
mean((mynnet.predict - MyData$DOC)^2) ## mean squared error was 16.5      

The data I am analyzing looks as follows, where the DOC is the variable that has to be modeled (there are about 17,000 observations):

      Q  GW_level Temp   t_sum   DOC
1 0.045    0.070 12.50     0.2 11.17
2 0.046    0.070 12.61     0.4 11.09
3 0.046    0.068 12.66     2.8 11.16
4 0.047    0.050 12.66     0.4 11.28
5 0.049    0.050 12.55     0.6 11.45
6 0.050    0.048 12.45     0.4 11.48

Now, I have read that the model should be trained with 70% of the data points, and validated with the remaing 30% of the data points. How do I do this? Which functions do I have to use?

I used the train function from the caret package to calculate the parameters for size and decay.

require(caret)
my.grid <- expand.grid(.decay = c(0.5, 0.1), .size = c(5, 6, 7))
mynnetfit <- train(DOC/33.80 ~ ., data = MyData, method = "nnet", maxit = 100, tuneGrid = my.grid, trace = f)

Any direct help or linkage to other websites/posts is greatly appreciated.

Best Answer

Max Kuhn's caret Manual - Model Building is a great starting point.

I would think of the validation stage as occurring within the caret train() call, since it is choosing your hyperparameters of decay and size via bootstrapping or some other approach that you can specify via the trControl parameter. I call the data set I use for characterizing the error of the final chosen model my test set. Since caret handles selection of hyperparameters for you, you just need a training set and a test set.

You can use the createDataPartition() function in caret to split your data set into training and test sets. I tested this using the Prestige data set from the car package, which has information about income as related to level of education and occupational prestige:

library(car)
library(caret)
trainIndex <- createDataPartition(Prestige$income, p=.7, list=F)
prestige.train <- Prestige[trainIndex, ]
prestige.test <- Prestige[-trainIndex, ]

The createDataPartition() function seems a little misnamed because it doesn't create the partition for you, but rather provides a vector of indexes that you then can use to construct training and test sets. It's pretty easy to do this yourself in R using sample() but one thing createDataPartition() apparently does do is sample from within factor levels. Moreover, if your outcome is categorical, the distribution is maintained across the data partitions. It's not relevant in this case, however, since your outcome is continuous.

Now you can train your model on the training set:

my.grid <- expand.grid(.decay = c(0.5, 0.1), .size = c(5, 6, 7))
prestige.fit <- train(income ~ prestige + education, data = prestige.train,
    method = "nnet", maxit = 1000, tuneGrid = my.grid, trace = F, linout = 1)    

Aside: I had to add the linout parameter to get nnet to work with a regression (vs. classification) problem. Otherwise I got all 1s as predicted values from the model.

You can then call predict on the fit object using the test data set and calculate RMSE from the results:

prestige.predict <- predict(prestige.fit, newdata = prestige.test)
prestige.rmse <- sqrt(mean((prestige.predict - prestige.test$income)^2)) 
Related Question