[GIS] The stability of randomForest models after increasing predictor variables

land-coverlandsatrrandom forestremote sensing

With reference to the posts: Incorporating terrain data to predict canopy cover using randomForest in R and Random-Forest Classification of 10cm Imagery for species-distribution in R (no point-shapes)

I would like to know how and why would one's model to map canopy vs. non-canopy areas, improve if several predictor variables are added such as:

Vegetation indices
slope
elevation
aspect
multiple bands

If I do add several predictor variables to the training set, I have to predict the model over a raster stack consisting of the same layers right?

Therefore, can I incorporate bands from Landsat too (after down-scaling to 5.8 m) with my LISS IV bands, as part of the training data and the raster stack on which the model will predict?

Best Answer

One of the cool thing about random forest is that they probe at each node a random subset of the variables. The ones providing the split with best entropy (or other criteria) will be kept, while others will be discarded and possibly tested in a subsequent / different node. In very simple words, if a variable does not provide any information about the split (e.g. canopy vs non-canopy) will not be used in the final model. From the other point of view, a variable can become informative after a split on another variable, and, possibly, this variable will be useful at prediction time.

In principle, add all the information you can add (better if you have the prior knowledge that are somehow dependent with the outputs of your problem). Surely Landsat bands and images correlate well with the presence of canopy, so add them.

Personally, I use RF classifiers with thousands of variables. In these situations you only have to make sure that you employ many trees (prevent overfitting, in principle, the most you use the better it is) and that at each note you test indicatively sqrt(#variables) in order to explore properly your feature space. Regarding the fraction of training examples to use in bagging, I never observed significant differences in test accuracy. Training time can be better though.

Regarding your second point yes, all the variables that you use at training time have to be present at test (prediction) time. In the same order, with the same scaling. (Note that the scaling of the variable itself is not important for random forest, but the distribution of training and test data must obviously be the same!)

Regarding literature, the best thing I ever read about RF is this ("THE") RF tutorial. It is mostly biased towards computer vision applications (in particular towards the kinect body part recognition thing they developed) but is a very easy and nice read. After that, you should understand the whole RF thing. For remote sensing application IDK, maybe just browse academic journals. Surely a proper search with right keywords will give nice references.

EDIT: probably you will be interest only in the classification part of the above tutorial, but I suggest to read it entirely, it's very nice.

Related Solutions

[GIS] I don’t find the mistake in the RandomForest-classification

I think your problem is related to the layer names of your raster stack ('rasters'). Make sure these are the same as in your .csv. You can get the layer names with names(rasters) and set them with names(rasters) <- c("band1", "band2", "band3")

Hope this helps TimSalabim

[GIS] Rf-Classification seems to work but gives an error 15 seconds later

You need to make sure that names(sdata@data[,3:ncol(sdata@data)]) and names(xvars) are exactly the same. Check this using

identical(names(sdata@data[,3:ncol(sdata@data)]), names(xvars))

If TRUE, your predict should run fine.

The edit related warnings/errors are irrelevant, they relate to you trying to display a SpatialPolygonsDataFrame (and S4 class object) as a standard data.frame in RStudio.

EDIT: It seems you have differences between your stack layer names and your sdata@data data frame. Make sure these are the same. If you would like to include x and y coordinates as layers to your stack (if this makes sense obviously depends on your objective) you could do it like this:

x <- coordinates(xvars)[, 1]
y <- coordinates(xvars)[, 2]

x_rst <- y_rst <- xvars[[1]]
x_rst[] <- x
y_rst[] <- y

Then you would need to add those to your stack at the appropriate position:

xvars <- stack(x_rst, y_rst, xvars)

Note also, that you have additional variables in your sdata@data data frame ("band1.1" etc). I don't know where these come from, maybe you are merging something earlier? Again, for predict() to work properly, the layers of your stack and the columns from your training data need to be identical (the names of these).

Best Answer

Related Solutions

[GIS] I don’t find the mistake in the RandomForest-classification

[GIS] Rf-Classification seems to work but gives an error 15 seconds later

Related Question