Solved – R caret package and dumthe variables

caretcategorical datar

I've been trying to run boosted regression tree modelling on spatial data using the caret package in R. My predictor variables were all extracted from raster files on the environment, fx. soil type and landcover. Since these two latter variables are actually factors (but the codes are numeric), I have been creating dummy variables for them before I run the train function.

This all works well, except when I want to predict to larger areas. When predicting I am using a rasterstack with all my predictor rasters, but of course I do not have rasters for all the dummy variables. Do I really need to create dummy rasters as well, or is there a way around this?

I have tried running without dummy variables, but specifying these predictors as facors – the train function then seems to automatically convert these to dummy variables, but I again run into problems when predicting with my raster stack.

I have tried running regular boosted regression tree models using gbm alone (gbm.step), and I can easily get the predictions to work using rasterstack – but I would like to use the caret train function, so I can run some K-fold cross validation.

I hope someone can help me.
Thanks,
Lene Jung Kjær

Best Answer

I have experience with GBM using caret. I found that I can feed the factor variables without encoding to the caret't train with GBM, but when I analyzed the structure of the produced trees I found that inside the function my factor variables were one-hot encoded automatically.

Thus using the trained model for prediction, without making the encoding, always failed for this reason.

Related Solutions

Solved – Regression using dumthe variables

I am not a big fan of converting a continuous variable to multiple dummy variables. I guess the binning procedure is considered standard practice in score card development.

Regarding dummy variable insignificance: When you add a dummy variable in regression, the omitted group act as reference group. The reference group is compared to other groups corresponding to the dummy variables. When variables have a nonlinear relationship (e.g. quadratic) with log odds, you may get some dummy variables that are insignificant (the group whose effect is near to the reference group). My suggestion to see the pattern of log-odds in each bin before merging. Either you can make fewer final bins depending one the pattern or change the reference group. I know it is bit abstract. But, I will not be able to go to specific without knowing the case.

You could also drop the insignificant variable. Doing it this way, you are merging the group associated with dropping dummy. It may not be appropriate if the merging of reference group and the dummy group (insignificant) doesn't make business sense.

Solved – Understanding the output of C5.0 classification model using the CARET package

Thanks for the plug =]

1) The winnowing process is erroneously removing predictors that can improve the accuracy of the model. Within the cross-validation loop, the winnowing process thinks that it is improving the accuracy, but that is not holding up once other samples are used to evaluate performance. Sometimes it helps and other times is doesn't

2) There is no graph of the tree yet (but it is on my list). Try using the summary function:

> set.seed(1)
> mod <- train(Species ~ ., data = iris, method = "C5.0")
> ## This data set liked rules over trees but it works the same for trees
> summary(mod$finalModel)

Call:
<snip>
-----  Trial 0:  -----

Rules:

Rule 0/1: (50, lift 2.9)
        Petal.Length <= 1.9
        ->  class setosa  [0.981]

Rule 0/2: (48/1, lift 2.9)
    Petal.Length > 1.9
    Petal.Length <= 4.9
    Petal.Width <= 1.7
    ->  class versicolor  [0.960]
<snip>
Evaluation on training data (150 cases):

Trial           Rules     
-----     ----------------
      No           Errors

   0         4    4( 2.7%)
   1         5    8( 5.3%)
   2         3    6( 4.0%)
   3         6   12( 8.0%)
   4         4    5( 3.3%)
   5         7    3( 2.0%)
   6         3    8( 5.3%)
   7         8   15(10.0%)
   8         4    3( 2.0%)
   9         5    5( 3.3%)
boost             0( 0.0%)   <<


   (a)   (b)   (c)    <-classified as
  ----  ----  ----
    50                (a): class setosa
          50          (b): class versicolor
                50    (c): class virginica


Attribute usage:

100.00% Petal.Length
 66.67% Petal.Width
 54.00% Sepal.Width
 46.67% Sepal.Length


Time: 0.0 secs

HTH,

Max