Solved – Caret error: train()’s use of ROC codes requires class probabilities. See the classProbs option of trainControl() when using Logistic Regression

caretclassificationlogisticr

I'm trying to fit a weighted binomial logistic regression model(using the maximum likelihood estimation method) to credit card transaction data. I'm trying to do this using the caret package.

This is what my dataset looks like.

         V1          V2        V3         V4          V5          V6          V7          V8
1 -1.3598071 -0.07278117 2.5363467  1.3781552 -0.33832077  0.46238778  0.23959855  0.09869790
2  1.1918571  0.26615071 0.1664801  0.4481541  0.06001765 -0.08236081 -0.07880298  0.08510165
3 -1.3583541 -1.34016307 1.7732093  0.3797796 -0.50319813  1.80049938  0.79146096  0.24767579
4 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888  1.24720317  0.23760894  0.37743587
5 -1.1582331  0.87773675 1.5487178  0.4030339 -0.40719338  0.09592146  0.59294075 -0.27053268
          V9         V10        V11         V12        V13        V14        V15        V16
1  0.3637870  0.09079417 -0.5515995 -0.61780086 -0.9913898 -0.3111694  1.4681770 -0.4704005
2 -0.2554251 -0.16697441  1.6127267  1.06523531  0.4890950 -0.1437723  0.6355581  0.4639170
3 -1.5146543  0.20764287  0.6245015  0.06608369  0.7172927 -0.1659459  2.3458649 -2.8900832
4 -1.3870241 -0.05495192 -0.2264873  0.17822823  0.5077569 -0.2879237 -0.6314181 -1.0596472
5  0.8177393  0.75307443 -0.8228429  0.53819555  1.3458516 -1.1196698  0.1751211 -0.4514492
         V17         V18        V19         V20          V21          V22        V23
1  0.2079712  0.02579058  0.4039930  0.25141210 -0.018306778  0.277837576 -0.1104739
2 -0.1148047 -0.18336127 -0.1457830 -0.06908314 -0.225775248 -0.638671953  0.1012880
3  1.1099694 -0.12135931 -2.2618571  0.52497973  0.247998153  0.771679402  0.9094123
4 -0.6840928  1.96577500 -1.2326220 -0.20803778 -0.108300452  0.005273597 -0.1903205
5 -0.2370332 -0.03819479  0.8034869  0.40854236 -0.009430697  0.798278495 -0.1374581
          V24        V25        V26          V27         V28      Amount Class
1  0.06692807  0.1285394 -0.1891148  0.133558377 -0.02105305  0.24496383     0
2 -0.33984648  0.1671704  0.1258945 -0.008983099  0.01472417 -0.34247394     0
3 -0.68928096 -0.3276418 -0.1390966 -0.055352794 -0.05975184  1.16068389     0
4 -1.17557533  0.6473760 -0.2219288  0.062722849  0.06145763  0.14053401     0
5  0.14126698 -0.2060096  0.5022922  0.219422230  0.21515315 -0.07340321     0

The Class column can contain values of 1 or 0.

Here is where I try to fit the model:

fitWeightedLogitModel = function(trainSet) {
  model_weights <- ifelse(trainSet$Class == 1, 
                         (1/table(trainSet$Class)[1]) * 0.5, 
                         (1/table(trainSet$Class)[2]) * 0.5)

  #model <- glm(Class ~ ., data = trainSet, weights = wt, 
   #            family = binomial())

  weighted_model <- train(Class ~ ., 
                          data = trainSet,
                          method = "glm",
                          family = binomial(),
                          verbose = FALSE, 
                          weights = model_weights, 
                          metric = "ROC", 
                          trControl = trainControl(
                            classProbs = TRUE,
                            method = "cv",
                            number = 10,
                            summaryFunction = twoClassSummary))
  return(weighted_model)
}

When I execute that method I get this error:

Error in evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels,  : 
  train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()
In addition: Warning messages:
1: In train.default(x, y, weights = w, ...) :
  You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
2: In train.default(x, y, weights = w, ...) :

When I view the stacktrace of the classProbs error:

8.
stop("train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()") 
7.
evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels, 
    metric = metric, method = method) 
6.
train.default(x, y, weights = w, ...) 
5.
train(x, y, weights = w, ...) 
4.
train.formula(Class ~ ., data = trainSet, method = "glm", family = binomial(), 
    verbose = FALSE, weights = model_weights, metric = "ROC", 
    trControl = trainControl(classProbs = TRUE, method = "cv", 
        number = 10, summaryFunction = twoClassSummary)) 
3.
train(Class ~ ., data = trainSet, method = "glm", family = binomial(), 
    verbose = FALSE, weights = model_weights, metric = "ROC", 
    trControl = trainControl(classProbs = TRUE, method = "cv", 
        number = 10, summaryFunction = twoClassSummary)) 
2.
fitWeightedLogitModel(train) 
1.
executeWeightedLogit() 

How can I get past the classProbs error? Because I do have it set to TRUE in trainControl().

Also how would I get past the You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column. warning? I thought that method ="glm" with family = "binomial" would tell R that I want to use logistic regression and that the response can only be two values.

Best Answer

I can reproduce your error with simulated data.

library(caret)

data <- twoClassSim()
data$numericClass <- ifelse(data$Class == "Class2", 1,0)
data$factorClass <- data$Class

data$Class <- data$numericClass

# don't forget to remove extra class variables! 
fitWeightedLogitModel(data[,-(17:18)]) # generates warning and error

data$Class <- data$factorClass

fitWeightedLogitModel(data[,-(17:18)]) # generates a different error!

So the first part of the answer is, train() expects a factor if it is doing classification. glm(..., family="binomial") is cool with 0/1, but train() isn't. So coerce class to a factor by adding:

trainset$Class <- factor(trainset$Class)

There is another error that arises, and you can fix that by removing the line

verbose = FALSE,

from the call to train().

That now seems to run OK. I get warnings about non-integer number of successes from glm(), but otherwise it is generating OK results.

Related Question