Solved – R caret package and dumthe variables

caretcategorical datar

I've been trying to run boosted regression tree modelling on spatial data using the caret package in R. My predictor variables were all extracted from raster files on the environment, fx. soil type and landcover. Since these two latter variables are actually factors (but the codes are numeric), I have been creating dummy variables for them before I run the train function.

This all works well, except when I want to predict to larger areas. When predicting I am using a rasterstack with all my predictor rasters, but of course I do not have rasters for all the dummy variables. Do I really need to create dummy rasters as well, or is there a way around this?

I have tried running without dummy variables, but specifying these predictors as facors – the train function then seems to automatically convert these to dummy variables, but I again run into problems when predicting with my raster stack.

I have tried running regular boosted regression tree models using gbm alone (gbm.step), and I can easily get the predictions to work using rasterstack – but I would like to use the caret train function, so I can run some K-fold cross validation.

I hope someone can help me.
Thanks,
Lene Jung Kjær

Best Answer

I have experience with GBM using caret. I found that I can feed the factor variables without encoding to the caret't train with GBM, but when I analyzed the structure of the produced trees I found that inside the function my factor variables were one-hot encoded automatically.

Thus using the trained model for prediction, without making the encoding, always failed for this reason.