Solved – Caret error: train()’s use of ROC codes requires class probabilities. See the classProbs option of trainControl() when using Logistic Regression

caretclassificationlogisticr

I'm trying to fit a weighted binomial logistic regression model(using the maximum likelihood estimation method) to credit card transaction data. I'm trying to do this using the caret package.

This is what my dataset looks like.

         V1          V2        V3         V4          V5          V6          V7          V8
1 -1.3598071 -0.07278117 2.5363467  1.3781552 -0.33832077  0.46238778  0.23959855  0.09869790
2  1.1918571  0.26615071 0.1664801  0.4481541  0.06001765 -0.08236081 -0.07880298  0.08510165
3 -1.3583541 -1.34016307 1.7732093  0.3797796 -0.50319813  1.80049938  0.79146096  0.24767579
4 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888  1.24720317  0.23760894  0.37743587
5 -1.1582331  0.87773675 1.5487178  0.4030339 -0.40719338  0.09592146  0.59294075 -0.27053268
          V9         V10        V11         V12        V13        V14        V15        V16
1  0.3637870  0.09079417 -0.5515995 -0.61780086 -0.9913898 -0.3111694  1.4681770 -0.4704005
2 -0.2554251 -0.16697441  1.6127267  1.06523531  0.4890950 -0.1437723  0.6355581  0.4639170
3 -1.5146543  0.20764287  0.6245015  0.06608369  0.7172927 -0.1659459  2.3458649 -2.8900832
4 -1.3870241 -0.05495192 -0.2264873  0.17822823  0.5077569 -0.2879237 -0.6314181 -1.0596472
5  0.8177393  0.75307443 -0.8228429  0.53819555  1.3458516 -1.1196698  0.1751211 -0.4514492
         V17         V18        V19         V20          V21          V22        V23
1  0.2079712  0.02579058  0.4039930  0.25141210 -0.018306778  0.277837576 -0.1104739
2 -0.1148047 -0.18336127 -0.1457830 -0.06908314 -0.225775248 -0.638671953  0.1012880
3  1.1099694 -0.12135931 -2.2618571  0.52497973  0.247998153  0.771679402  0.9094123
4 -0.6840928  1.96577500 -1.2326220 -0.20803778 -0.108300452  0.005273597 -0.1903205
5 -0.2370332 -0.03819479  0.8034869  0.40854236 -0.009430697  0.798278495 -0.1374581
          V24        V25        V26          V27         V28      Amount Class
1  0.06692807  0.1285394 -0.1891148  0.133558377 -0.02105305  0.24496383     0
2 -0.33984648  0.1671704  0.1258945 -0.008983099  0.01472417 -0.34247394     0
3 -0.68928096 -0.3276418 -0.1390966 -0.055352794 -0.05975184  1.16068389     0
4 -1.17557533  0.6473760 -0.2219288  0.062722849  0.06145763  0.14053401     0
5  0.14126698 -0.2060096  0.5022922  0.219422230  0.21515315 -0.07340321     0

The Class column can contain values of 1 or 0.

Here is where I try to fit the model:

fitWeightedLogitModel = function(trainSet) {
  model_weights <- ifelse(trainSet$Class == 1, 
                         (1/table(trainSet$Class)[1]) * 0.5, 
                         (1/table(trainSet$Class)[2]) * 0.5)

  #model <- glm(Class ~ ., data = trainSet, weights = wt, 
   #            family = binomial())

  weighted_model <- train(Class ~ ., 
                          data = trainSet,
                          method = "glm",
                          family = binomial(),
                          verbose = FALSE, 
                          weights = model_weights, 
                          metric = "ROC", 
                          trControl = trainControl(
                            classProbs = TRUE,
                            method = "cv",
                            number = 10,
                            summaryFunction = twoClassSummary))
  return(weighted_model)
}

When I execute that method I get this error:

Error in evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels,  : 
  train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()
In addition: Warning messages:
1: In train.default(x, y, weights = w, ...) :
  You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
2: In train.default(x, y, weights = w, ...) :

When I view the stacktrace of the classProbs error:

8.
stop("train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()") 
7.
evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels, 
    metric = metric, method = method) 
6.
train.default(x, y, weights = w, ...) 
5.
train(x, y, weights = w, ...) 
4.
train.formula(Class ~ ., data = trainSet, method = "glm", family = binomial(), 
    verbose = FALSE, weights = model_weights, metric = "ROC", 
    trControl = trainControl(classProbs = TRUE, method = "cv", 
        number = 10, summaryFunction = twoClassSummary)) 
3.
train(Class ~ ., data = trainSet, method = "glm", family = binomial(), 
    verbose = FALSE, weights = model_weights, metric = "ROC", 
    trControl = trainControl(classProbs = TRUE, method = "cv", 
        number = 10, summaryFunction = twoClassSummary)) 
2.
fitWeightedLogitModel(train) 
1.
executeWeightedLogit()

How can I get past the classProbs error? Because I do have it set to TRUE in trainControl().

Also how would I get past the You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column. warning? I thought that method ="glm" with family = "binomial" would tell R that I want to use logistic regression and that the response can only be two values.

Best Answer

I can reproduce your error with simulated data.

library(caret)

data <- twoClassSim()
data$numericClass <- ifelse(data$Class == "Class2", 1,0)
data$factorClass <- data$Class

data$Class <- data$numericClass

# don't forget to remove extra class variables! 
fitWeightedLogitModel(data[,-(17:18)]) # generates warning and error

data$Class <- data$factorClass

fitWeightedLogitModel(data[,-(17:18)]) # generates a different error!

So the first part of the answer is, train() expects a factor if it is doing classification. glm(..., family="binomial") is cool with 0/1, but train() isn't. So coerce class to a factor by adding:

trainset$Class <- factor(trainset$Class)

There is another error that arises, and you can fix that by removing the line

verbose = FALSE,

from the call to train().

That now seems to run OK. I get warnings about non-integer number of successes from glm(), but otherwise it is generating OK results.

Related Solutions

Solved – way to return the standard error of cross-validation predictions using caret `train`

You can use all of the options available in the train object. The number of folds and repeats are available in train$control The accuracy numbers are in train$results

I have taken the example and code out of the book and recreated the picture for 10 CV with the following code using dplyr and ggplot2.

library(caret)
data(GermanCredit)
GermanCredit <- GermanCredit[, -nearZeroVar(GermanCredit)]
GermanCredit$CheckingAccountStatus.lt.0 <- NULL
GermanCredit$SavingsAccountBonds.lt.100 <- NULL
GermanCredit$EmploymentDuration.lt.1 <- NULL
GermanCredit$EmploymentDuration.Unemployed <- NULL
GermanCredit$Personal.Male.Married.Widowed <- NULL
GermanCredit$Property.Unknown <- NULL
GermanCredit$Housing.ForFree <- NULL


set.seed(100)
inTrain <- createDataPartition(GermanCredit$Class, p = .8)[[1]]
GermanCreditTrain <- GermanCredit[ inTrain, ]
GermanCreditTest  <- GermanCredit[-inTrain, ]

library(kernlab)
set.seed(231)
sigDist <- sigest(Class ~ ., data = GermanCreditTrain, frac = 1)
svmTuneGrid <- data.frame(sigma = as.vector(sigDist)[1], C = 2^(-2:7))

set.seed(1056)
svmFit10CV <- train(Class ~ .,
                    data = GermanCreditTrain,
                    method = "svmRadial",
                    preProc = c("center", "scale"),
                    tuneGrid = svmTuneGrid,
                    trControl = trainControl(method = "cv", number = 10))

#create graph based on train$control and train$results objects 
library(dplyr)

svmFit10CV$results %>%
  mutate(accuracySD_low = Accuracy - 2*(AccuracySD/sqrt(svmFit10CV$control$number * svmFit10CV$control$repeats)),
         accuracySD_high = Accuracy + 2*(AccuracySD/sqrt(svmFit10CV$control$number * svmFit10CV$control$repeats))) %>%
  ggplot(aes(x = C)) +
  geom_line(aes(y = Accuracy)) +
  geom_point(aes(y = Accuracy)) +
  scale_x_log10() + #correct spacing of the cost parameter
  ylim(0.65, 0.8) + #set correct y-axis
  geom_errorbar(aes(ymin=accuracySD_low, ymax=accuracySD_high), 
                colour="gray50",
                width=.1) +
  labs(title="Estimates of prediction accuracy\nwith 2 SD errror bars")

Solved – R – How are the significance codes determined when summarizing a logistic regression model

Firstly, the z or t value (depending on what family you run) is the coefficient divided by the standard error. The p value is then derived from the normal or t distributions using this z or t value.

The stars don't really add much in my view. You will see underneath the table of coefficients that there is a line which starts 'Signif. codes'. This gives the key. So a coefficient marked *** is one whose p value < 0.001. One whose coefficient is marked ** is p < 0.01. And so on.

For example (taken from https://stats.idre.ucla.edu/r/dae/logit-regression/):

mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)

Gives the following output:

Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
    data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6268  -0.8662  -0.6388   1.1490   2.0790  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.52  on 394  degrees of freedom
AIC: 470.52

Number of Fisher Scoring iterations: 4

You can see that gre has a p value = 0.038. This has one asterisk by it because that is < 0.05. rank4 has a p value = 0.0002 and so has three asterisks because this is < 0.001.

I just use the asterisks to quickly scan the table but I never look at them beyond that.

Best Answer

Related Solutions

Solved – way to return the standard error of cross-validation predictions using caret `train`

Solved – R – How are the significance codes determined when summarizing a logistic regression model

Related Question