[GIS] The stability of randomForest models after increasing predictor variables

land-coverlandsatrrandom forestremote sensing

With reference to the posts: Incorporating terrain data to predict canopy cover using randomForest in R and Random-Forest Classification of 10cm Imagery for species-distribution in R (no point-shapes)

I would like to know how and why would one's model to map canopy vs. non-canopy areas, improve if several predictor variables are added such as:

  • Vegetation indices
  • slope
  • elevation
  • aspect
  • multiple bands

If I do add several predictor variables to the training set, I have to predict the model over a raster stack consisting of the same layers right?

Therefore, can I incorporate bands from Landsat too (after down-scaling to 5.8 m) with my LISS IV bands, as part of the training data and the raster stack on which the model will predict?

Best Answer

One of the cool thing about random forest is that they probe at each node a random subset of the variables. The ones providing the split with best entropy (or other criteria) will be kept, while others will be discarded and possibly tested in a subsequent / different node. In very simple words, if a variable does not provide any information about the split (e.g. canopy vs non-canopy) will not be used in the final model. From the other point of view, a variable can become informative after a split on another variable, and, possibly, this variable will be useful at prediction time.

In principle, add all the information you can add (better if you have the prior knowledge that are somehow dependent with the outputs of your problem). Surely Landsat bands and images correlate well with the presence of canopy, so add them.

Personally, I use RF classifiers with thousands of variables. In these situations you only have to make sure that you employ many trees (prevent overfitting, in principle, the most you use the better it is) and that at each note you test indicatively sqrt(#variables) in order to explore properly your feature space. Regarding the fraction of training examples to use in bagging, I never observed significant differences in test accuracy. Training time can be better though.

Regarding your second point yes, all the variables that you use at training time have to be present at test (prediction) time. In the same order, with the same scaling. (Note that the scaling of the variable itself is not important for random forest, but the distribution of training and test data must obviously be the same!)

Regarding literature, the best thing I ever read about RF is this ("THE") RF tutorial. It is mostly biased towards computer vision applications (in particular towards the kinect body part recognition thing they developed) but is a very easy and nice read. After that, you should understand the whole RF thing. For remote sensing application IDK, maybe just browse academic journals. Surely a proper search with right keywords will give nice references.

EDIT: probably you will be interest only in the classification part of the above tutorial, but I suggest to read it entirely, it's very nice.