The following code trains multiple decision trees on synthetic data with varying complexity:
library(caret)
d<-twoClassSim(10000, intercept = -10, linearVars = 10, noiseVars = 10 )
c<-trainControl(method="cv",summaryFunction=twoClassSummary,classProbs=T,allowParallel = F)
train(Class~.,data=d, method="rpart", trControl=tc, tuneGrid = expand.grid(cp=c(2^-seq(1:24),0)), metric="ROC")
These are the results:
CART
10000 samples
25 predictor
2 classes: 'Class1', 'Class2'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 9001, 9001, 9000, 9000, 9000, 9000, ...
Resampling results across tuning parameters:
cp ROC Sens Spec
0.000000e+00 0.8720221 0.9175038 0.6351468
5.960464e-08 0.8693352 0.9178879 0.6338036
1.192093e-07 0.8693352 0.9178879 0.6338036
2.384186e-07 0.8693352 0.9178879 0.6338036
4.768372e-07 0.8693352 0.9178879 0.6338036
9.536743e-07 0.8693352 0.9178879 0.6338036
1.907349e-06 0.8693352 0.9178879 0.6338036
3.814697e-06 0.8693352 0.9178879 0.6338036
7.629395e-06 0.8693352 0.9178879 0.6338036
1.525879e-05 0.8693352 0.9178879 0.6338036
3.051758e-05 0.8693352 0.9178879 0.6338036
6.103516e-05 0.8693352 0.9178879 0.6338036
1.220703e-04 0.8688977 0.9184034 0.6338036
2.441406e-04 0.8695238 0.9190479 0.6333571
4.882812e-04 0.8683167 0.9199503 0.6346964
9.765625e-04 0.8642201 0.9234327 0.6275635
1.953125e-03 0.8502711 0.9358066 0.6061528
3.906250e-03 0.8170988 0.9421235 0.5776111
7.812500e-03 0.7992001 0.9391563 0.5624742
1.562500e-02 0.7309271 0.9416099 0.4928790
3.125000e-02 0.7279783 0.9249799 0.5267897
6.250000e-02 0.7279783 0.9249799 0.5267897
1.250000e-01 0.6607248 0.9497346 0.3688948
2.500000e-01 0.5000000 1.0000000 0.0000000
5.000000e-01 0.5000000 1.0000000 0.0000000
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.0002441406.
Clearly, the decision tree has it's highest ROC AUC with complexity parameter 0 indicating that it does not overfit at all. How can that be explained? Is this plausible?
Best Answer
In
rpart
package, in addition tocp
, parameterminsplit
,minbucket
andmaxdepth
also has default values, that will prevent over fit every instance.Try to set
minsplit=1
andminbucket=1
.A related discussion can be found here.
Why I cannot achieve 100% accuracy in my simple training data with CART model?