Solved – Use Random Forest model to make predictions from sensor data

caretr

say I have a sensor that measures temperature, pressure ++, and want to use this data to predict some quantity "A". If I use multivariate regression, I can simply implement a model of the form A=a0+a1x1+a2x2+…, and whenever I have new measurements I can use the model to make predictions.

If I on the other hand make a predictive model using random forests, I'm not really sure how to use it. I've used the caret package to split my data into training and test sets, and do automatic feature selection using random forest and cross-validation. I get good predictions on the test set, but have no idea how to implement these trees to use in say a digital signal processor. In R I just use the predict() function, but this is obviously not available outside of R.

This is probably a stupid questing, but it's the best I can do.

Any suggestions are welcome.

Best Answer

Linear regressions are great because you can implement prediction very simply in any program that can multiply and add.

Random forests, on the other hand, are much more complicated. They are made up individually of decision trees, which can basically be represented by a set of rules. However, a random forest may have hundreds or thousands of individual trees, which would be very tedious to implement by hand in another system.

Your best bet is probably going to be to find a random forest implementation for the system you wish to export the model to, and then use PMML to export the model. The RPMML package will let you convert your random forest to an XML file, which you should be able to import to any system that supports PMML.

Related Solutions

Solved – Feature selection before neural network classification

One way to think about the process of building a predictive model (such as a neural network) is that you have a 'budget' of information to spend, much like a certain amount of money for a monthly household budget. With only 87 observations in your training set (and only 36 more in your test set), you have a very skimpy budget. In addition, there is much less information in a binary indicator (i.e., your predicted variable is positive vs. negative) than there is in a continuous variable. In truth, you may only have enough information to reliably estimate the proportion positive.

Neural networks have many advantages, but they require very large numbers of parameters to be estimated. When you have a hidden layer (or more than one hidden layer), and multiple input variables, the number of parameters (link weights) that need to be accurately estimated explodes. But every parameter to be estimated consumes some of your informational budget. You are essentially guaranteed to overfit this model (note that this has nothing to do with the computational feasibility of the algorithm). Unfortunately, I don't think cross-validation will get you out of these problems.

If you are committed to building a predictive model using your continuous variables, I would try a logistic regression model instead of a NN. It will use fewer parameters. I would fit the model with probably only one variable, or at most a couple, and use the test set to see if the additional variables (beyond the intercept only) create instability and reduce your out of sample accuracy.

Regarding the X variables themselves, I would use a method that is blind to the outcome. Specifically, I would try principal components analysis (PCA) and extract just the first one or two PCs. I honestly think this is going to be the best you are going to be able to do.

Solved – Different results from randomForest via caret and the basic randomForest package

I think the question while somewhat trivial and "programmatic" at first read touches upon two main issues that very important in modern Statistics:

reproducibility of results and
non-deterministic algorithms.

The reason for the different results is that the two procedure are trained using different random seeds. Random forests uses a random subset from the full-dataset's variables as candidates at each split (that's the mtry argument and relates to the random subspace method) as well as bags (bootstrap aggregates) the original dataset to decrease the variance of the model. These two internal random sampling procedures thought are not deterministic between different runs of the algorithm. The random order which the sampling is done is controlled by the random seeds used. If the same seeds were used, one would get the exact same results in both cases where the randomForest routine is called; both internally in caret::train as well as externally when fitting a random forest manually. I attach a simple code snippet to show-case this. Please note that I use a very small number of trees (argument: ntree) to keep training fast, it should be generally much larger.

library(caret)

set.seed(321)
trainData <- twoClassSim(5000, linearVars = 3, noiseVars = 9)
testData  <- twoClassSim(5000, linearVars = 3, noiseVars = 9)

set.seed(432)
mySeeds <- sapply(simplify = FALSE, 1:26, function(u) sample(10^4, 3))
cvCtrl = trainControl(method = "repeatedcv", number = 5, repeats = 5, 
                      classProbs = TRUE, summaryFunction = twoClassSummary, 
                      seeds = mySeeds)

fitRFcaret = train(Class ~ ., data = trainData, trControl = cvCtrl, 
                   ntree = 33, method = "rf", metric="ROC")

set.seed( unlist(tail(mySeeds,1))[1])
fitRFmanual <- randomForest(Class ~ ., data=trainData, 
                            mtry = fitRFcaret$bestTune$mtry, ntree=33)

At this point both the caret.train object fitRFcaret as well as the manually defined randomForest object fitRFmanual have been trained using the same data but importantly using the same random seeds when fitting their final model. As such when we will try to predict using these objects and because we do no preprocessing of our data we will get the same exact answers.

all.equal(current =  as.vector(predict(fitRFcaret, testData)), 
          target = as.vector(predict(fitRFmanual, testData)))
# TRUE

Just to clarify this later point a bit further: predict(xx$finalModel, testData) and predict(xx, testData) will be different if one sets the preProcess option when using train. On the other hand, when using the finalModel directly it is equivalent using the predict function from the model fitted (predict.randomForest here) instead of predict.train; no pre-proessing takes place. Obviously in the scenario outlined in the original question where no pre-processing is done the results will be the same when using the finalModel, the manually fitted randomForest object or the caret.train object.

all.equal(current =  as.vector(predict(fitRFcaret$finalModel, testData)), 
          target = as.vector(predict(fitRFmanual, testData)))
 # TRUE

all.equal(current =  as.vector(predict(fitRFcaret$finalModel, testData)),
          target = as.vector(predict(fitRFcaret, testData)))
# TRUE

I would strongly suggest that you always set the random seed used by R, MATLAB or any other program used. Otherwise, you cannot check the reproducibility of results (which OK, it might not be the end of the world) nor exclude a bug or external factor affecting the performance of a modelling procedure (which yeah, it kind of sucks). A lot of the leading ML algorithms (eg. gradient boosting, random forests, extreme neural networks) do employ certain internal resampling procedures during their training phases, setting the random seed states prior (or sometimes even within) their training phase can be important.

Best Answer

Related Solutions

Solved – Feature selection before neural network classification

Solved – Different results from randomForest via caret and the basic randomForest package

Related Question