Solved – preprocess the new data for a prediction, if I have used preprocessing for building the model

caretnnetr

In this example preprocessing is used to construct a NN:

nnetTune <- train(x = solTrainXtrans, y = solTrainY,
              method = "avNNet",
              tuneGrid = nnetGrid,
              trControl = ctrl,
              preProc = c("center", "scale"),
              linout = TRUE,
              trace = FALSE,
              MaxNWts = 13 * (ncol(solTrainXtrans) + 1) + 13 + 1,
              maxit = 1000,
              allowParallel = FALSE)

If I make predictions with new data, do I have to pre-process this new data or can I directly insert the new data in the model?

Would that be different if I use the model below where data X is preprocessed (centered and scaled) before it is inserted in the nnet?

fit <- nnet(Y~., X, size=12, maxit=500, linout=T, decay=0.01)

Thank you!

Best Answer

Yes, the new data have to be pre-processed as well.

EDIT (based on your last comment):

For your fist code block, I am not sure whether the new data are automatically pre-processed, just because you used the preProc argument.

For your second code block, yes, nnet() does not provide any functionality to pre-process the data.

I would recommend to use the preProcess() function of caret. Actually, when you use preProc as your input argument the preProcess() function is called. You can define the kind of pre-processing you need in the preProcess() function, and then using the predict() function you actually pre-process the data in question. Now, the advantage of using preProcess() is that you can either use the predict() function to pre-process new data, or use the newdata input argument of the preProcess function, which actually does the same thing. Refer to the documentation for more details.

Of course you can pre-process just a single observation. In your example, you center and scale the training set. This means that you compute the mean value and standard deviation of the training set, and then you subtract the mean value and divide by the standard deviation, so as the transformed training set has now mean value of 0, and standard deviation of 1. If you want to pre-process just a single observation, you can just subtract and divide this observation with the aforementioned mean value, and standard deviation, respectively.

As a simple example on how to use the preProcess() function (taken from the documentation):

data(BloodBrain)

preProc  <- preProcess(bbbDescr[1:100,-3])
training <- predict(preProc, bbbDescr[1:100,-3])
test     <- predict(preProc, bbbDescr[101:208,-3])

One last thing; you mention
If I do preprocessing by myself with the testdata and use that preprocessed testdata as input for fitting the nnet... - Just to make this clear, you should fit the training data, and then use the predict() function to generate predictions for new data.

Hope it helps!

Related Question