Solved – R caret difference between ROC curve and accuracy for classification

carettrain

In case of caret package test function metric option, one can use either accuracy or ROC as a metric that will be used to finalize values of tuning parameters. I felt that accuracy and ROC are the same

Accuracy = total correct predictions/total samples

ROC= looks at various probability cutoffs and gives probability cutoff at which accuracy will be the best.

Am I correct?

Best Answer

Generally speaking, Area Under the RoC (AUROC) statistic is used when you have imbalanced classes. For ex: 5% 1's and 95% 0's.

In practice, we are more interested in the AUROC to judge how well the model rank orders cases (i.e., rank from high probablity to low probablity of being a 1) where as Accuracy is... well you already know that.

In the context of model tuning, my advice would be to use AUC (especially if you have imabalanced classes) instead of Accuracy.

Related Solutions

Solved – way to disable the parameter tuning (grid) feature in CARET

You can specify method="none" in trainControl. For example:

train(Species ~ ., data=iris, method="rf", tuneGrid=data.frame(mtry=3),
  trControl=trainControl(method="none"))

I'm not sure when this was implemented.

Solved – Confusion between caret randomForest predict() results and reported model performance

The summary printed for the model contains the line

6     0.76      0.68   0.0507       0.068

which tells you that the expected/average accuracy for a proprley cross-validaded (training separated from testing) experiment should be 0.76

I have never used the line

model$pred[model$pred$mtry == 6, c("pred", "obs")

before but I guess it is giving you the aggregated results of all the internal cross-validations done when testing for mtry=6. You get a 0.7893916 which is pretty close to 0.76.

Caret, by default also generates the final model with all the training data provided, which is the model used in the line

pred=predict(model, data_pred_scale),

so what is curious is that the random forest generated gets a 100% accuracy when tested with the data used to train it. It is not impossible, of course, but just curious.

This phenomenon is not technically called overfitting, it goes beyond that - I do not know any good reason to test a classifier on the data used to train it.

Best Answer

Related Solutions

Solved – way to disable the parameter tuning (grid) feature in CARET

Solved – Confusion between caret randomForest predict() results and reported model performance

Related Question