Solved – Understanding the output of C5.0 classification model using the CARET package

caretclassificationfeature selectionmachine learningr

The C5.0 classification model was used in this 4-class problem data with $N_{train}$=165, $P$=11, using caret R-package by running the code below. The winnowing option was tuned over in the model, which is a kind of feature selection approach. This excerpt I quote regarding winnowing from the companion book of caret, a must-have book in my opinion to realize hidden gems coded in the package:
Kuhn M, Johnson K. Applied predictive modeling. 1st edition. New York: Springer. 2013.

C5.0 also has an option to winnow or remove predictors: an initial
algorithm uncovers which predictors have a relationship with the
outcome, and the ﬁnal model is created from only the important
predictors. To do this, the training set is randomly split in half and
a tree is created for the purpose of evaluating the utility of the
predictors (call this the “winnowing tree”). Two procedures
characterize the importance of each predictor to the model:
1. Predictors are considered unimportant if they are not in any split in the winnowing tree.
2. The half of the training set samples not included to create the winnowing tree are used to estimate the error rate of the tree. The
error rate is also estimated without each predictor and compared to
the error rate when all the predictors are used. If the error rate
improves without the predictor, it is deemed to be irrelevant and is
provisionally removed.

c50Grid <- expand.grid(.trials = c(1:9, (1:10)*10),
                       .model = c("tree", "rules"),
                       .winnow = c(TRUE, FALSE))
c50Grid
set.seed(1) # important to have reproducible results
c5Fitvac <- train(Class ~ .,
                   data = training,
                   method = "C5.0",
                   tuneGrid = c50Grid,
                   trControl = ctrl,
                   metric = "Accuracy", # not needed it is so by default
                   importance=TRUE, # not needed
                   preProc = c("center", "scale"))  
> c5Fitvac$finalModel$tuneValue
   trials model winnow
16     70  tree  FALSE

CV tuning output:
enter image description here

Excerpt from the C5.0 tree output:

> c5Fitvac$finalModel$tree
[1] "id=\"See5/C5.0 2.07 GPL Edition 2014-01-22\"\nentries=\"70\"\ntype=\"2\" class=\"Q\" freq=\"9,16,60,80\" att=\"IL17A\" forks=\"3\" cut=\"0.92485309\"\ntype=\"0\" class=\"Q\"\ntype=\"2\" class=\"Q\" freq=\"0,4,59,80\" att=\"IL23R\" forks=\"3\" cut=\"0.26331303\"\ntype=\"0\" class=\"Q\"\ntype=\"2\" class=\"Q\" freq=\"0,4,19,80\" att=\"IL12RB2\" forks=\"3\" cut=\"0.41611555\"\ntype=\"0\" class=\"Q\"\ntype=\"2\" class=\"Q\" freq=\"0,4,9,80\" att=\"IL23R\" forks=\

Now importance of predictors:

> predictors(c5Fitvac )
 [1] "IL23R"   "IL12RB2" "IL8"     "IL23A"   "IL6ST"   "IL12A"   "IL12RB1"
 [8] "IL27RA"  "IL12B"   "IL17A"   "EBI3"

Questions:

Why is it in the plot, the accuracy levels of No-winnowing about two times that of the winnowing? Can you please help interpreting this output when it says winnow = FALSE?
How to visualize the tree output, instead of the computed junk text that appeared in my case? is there any way to behold a tree instead of crowded symbols?

Best Answer

Thanks for the plug =]

1) The winnowing process is erroneously removing predictors that can improve the accuracy of the model. Within the cross-validation loop, the winnowing process thinks that it is improving the accuracy, but that is not holding up once other samples are used to evaluate performance. Sometimes it helps and other times is doesn't

2) There is no graph of the tree yet (but it is on my list). Try using the summary function:

> set.seed(1)
> mod <- train(Species ~ ., data = iris, method = "C5.0")
> ## This data set liked rules over trees but it works the same for trees
> summary(mod$finalModel)

Call:
<snip>
-----  Trial 0:  -----

Rules:

Rule 0/1: (50, lift 2.9)
        Petal.Length <= 1.9
        ->  class setosa  [0.981]

Rule 0/2: (48/1, lift 2.9)
    Petal.Length > 1.9
    Petal.Length <= 4.9
    Petal.Width <= 1.7
    ->  class versicolor  [0.960]
<snip>
Evaluation on training data (150 cases):

Trial           Rules     
-----     ----------------
      No           Errors

   0         4    4( 2.7%)
   1         5    8( 5.3%)
   2         3    6( 4.0%)
   3         6   12( 8.0%)
   4         4    5( 3.3%)
   5         7    3( 2.0%)
   6         3    8( 5.3%)
   7         8   15(10.0%)
   8         4    3( 2.0%)
   9         5    5( 3.3%)
boost             0( 0.0%)   <<


   (a)   (b)   (c)    <-classified as
  ----  ----  ----
    50                (a): class setosa
          50          (b): class versicolor
                50    (c): class virginica


Attribute usage:

100.00% Petal.Length
 66.67% Petal.Width
 54.00% Sepal.Width
 46.67% Sepal.Length


Time: 0.0 secs

HTH,

Max

Related Solutions

Solved – the best strategy to train and validate classification using PLS-[classifier] in caret package

I don't have time right now to answer all your questions, but here's a start:

yes you can optimize whatever hyperparameters you have to optimize with the same optimization set. The important thing is to make sure that the test set used for the final measurement of prediction performance is kept independent of all kinds of training data, and the optimization set is part of the training data.
See nested or double cross/resampling/out-of-bootstrap validation for search terms.
Model selection as in comparing a number of models and then keeping the "best" is a data-driven optimization. It belongs into the optimization stage, and in order to get the performance of the chosen model, you need independent test data!
As you already say that your sample size is quite small (though I'd be happy to have that many patients, the more so, as I have far more variates...), it is probably better to restrict yourself to not spend samples on optimization - the more so, as you probably cannot do any meaningful model comparison with that sample size anyways. And optimization (as in "pick the best") usually boils down to a massive multiple comparison situation.
bootstrap vs. cross validation: boostrap is preferred by some, cross validation by other disciplines. For my type of data, iterated k-fold cross validation and out-of-bootstrap had very similar total error:
Beleites, C. et al.: Variance reduction in estimating classification error using sparse datasets, Chemom Intell Lab Syst, 79, 91 - 100 (2005).
See also Kim, J.-H.: Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap , Computational Statistics & Data Analysis , 53, 3735 - 3745 (2009). DOI: 10.1016/j.csda.2009.04.009 for similar findings.
Comparing different PLS surrogate models generated during resampling validation:
- Centering: you don't have to mean center your data. That is just a default. One alternative that would be more stable across the surrogate models would be if among your classes you have e.g. a control group. Then the mean of the controls would make a nice center which is also biologically more meaningful.
- The same for scaling: variance scaling again is just some default procedure. But if you can derive a meaningful scaling by external knowledge that is fine, and possibly even constant for all surrogate models.
- Even if you stay with mean centering and variance scaling: you can compare the mean and scaling vectors of the different surrogate models. If they are not reasonably stable, that is a strong indicator that your sample size is too small to derive a meaningful predictive model.
- Once the center and scale is (approximately) constant, and if your dummy coding of the dependent stays the same, then (unlike PCA loadings and the PLS X loadings) the PLS coefficients should be stable as well. LDA or PLS-LDA final coefficients have some degrees of freedom (flipping and rotation) that don't affect the prediction. Therefore you can and should align the surrogate models accordingly to avoid seeing meaningless variation. See Beleites, C. et al.: Raman spectroscopic grading of astrocytoma tissues: using soft reference information, Anal Bioanal Chem, 400, 2801-2816 (2011). DOI: 10.1007/s00216-011-4985-4 for more explanations and an example.
Q4: such constraints are known as stratification.
as for caret and PLS preprocessing, maybe a glance at the code for PCA pre-processing allows you to define a PLS preprocessing? Otherwise, you'll probably have to set up custom PLS-xxx models

Solved – How to interpret this cross-validated sparse LDA figure using CARET package

question 0: why did you get ROC values? Because there is no model-specific variables importance method implemented for this model. From ?varImp it has "For models that do not have corresponding varImp methods, see filerVarImp."

1: There are a few reasons why more regularization may help. The primary one would be that you have correlated predictors and using the L2 penalty mitigates that. Also, it constrains the model fit so that you must have large(r) effect on the model fit to get large coefficients.

2: In the past, I have also been surprised that (what I consider to be) very large values of the L2 penalty end up having great results. My best guess is that, since the penalty is on the sum of the squared coefficients, the penalty may need to be large if there are a lot of predictors (but that is not the case here). I'm guessing that the positive influence of the L2 penalty is simply preventing overfitting from large coefficients (for example, see section 11.5.2 of HTF).

3: caret does have a class called predictors for this exact purpose. I haven't implemented it for this model (but I'll put it in the next release).

To get the answers that you want, the underlying sda object has the information. For the mdrr data in caret:

> set.seed(1)
> obj <- train(mdrrDescr[, 1:10], mdrrClass, 
+              method = "sparseLDA",
+              tuneGrid = data.frame(NumVars = 3, lambda = 1),
+              preProc = c("center", "scale"),
+              trControl = trainControl(method = "cv"))

Normally, you would try this first:

> predictors(obj)
[1] NA

but here you should use:

> obj$finalModel$xNames[obj$finalModel$varIndex]
[1] "Sp" "Me" "Mp"

The reason that you get more than 3 predictors for the iris data is that it will use 3 predictors per class:

> obj$finalModel$beta
           [,1]       [,2]
[1,]  0.0000000 -0.1133590
[2,]  0.2117277  0.6238909
[3,] -0.6860342  0.0000000
[4,] -0.6059387  0.4372637

Best Answer

Related Solutions

Solved – the best strategy to train and validate classification using PLS-[classifier] in caret package

Solved – How to interpret this cross-validated sparse LDA figure using CARET package

Related Question