I'm trying to fit a weighted binomial logistic regression model(using the maximum likelihood estimation method) to credit card transaction data. I'm trying to do this using the caret package.
This is what my dataset looks like.
V1 V2 V3 V4 V5 V6 V7 V8
1 -1.3598071 -0.07278117 2.5363467 1.3781552 -0.33832077 0.46238778 0.23959855 0.09869790
2 1.1918571 0.26615071 0.1664801 0.4481541 0.06001765 -0.08236081 -0.07880298 0.08510165
3 -1.3583541 -1.34016307 1.7732093 0.3797796 -0.50319813 1.80049938 0.79146096 0.24767579
4 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888 1.24720317 0.23760894 0.37743587
5 -1.1582331 0.87773675 1.5487178 0.4030339 -0.40719338 0.09592146 0.59294075 -0.27053268
V9 V10 V11 V12 V13 V14 V15 V16
1 0.3637870 0.09079417 -0.5515995 -0.61780086 -0.9913898 -0.3111694 1.4681770 -0.4704005
2 -0.2554251 -0.16697441 1.6127267 1.06523531 0.4890950 -0.1437723 0.6355581 0.4639170
3 -1.5146543 0.20764287 0.6245015 0.06608369 0.7172927 -0.1659459 2.3458649 -2.8900832
4 -1.3870241 -0.05495192 -0.2264873 0.17822823 0.5077569 -0.2879237 -0.6314181 -1.0596472
5 0.8177393 0.75307443 -0.8228429 0.53819555 1.3458516 -1.1196698 0.1751211 -0.4514492
V17 V18 V19 V20 V21 V22 V23
1 0.2079712 0.02579058 0.4039930 0.25141210 -0.018306778 0.277837576 -0.1104739
2 -0.1148047 -0.18336127 -0.1457830 -0.06908314 -0.225775248 -0.638671953 0.1012880
3 1.1099694 -0.12135931 -2.2618571 0.52497973 0.247998153 0.771679402 0.9094123
4 -0.6840928 1.96577500 -1.2326220 -0.20803778 -0.108300452 0.005273597 -0.1903205
5 -0.2370332 -0.03819479 0.8034869 0.40854236 -0.009430697 0.798278495 -0.1374581
V24 V25 V26 V27 V28 Amount Class
1 0.06692807 0.1285394 -0.1891148 0.133558377 -0.02105305 0.24496383 0
2 -0.33984648 0.1671704 0.1258945 -0.008983099 0.01472417 -0.34247394 0
3 -0.68928096 -0.3276418 -0.1390966 -0.055352794 -0.05975184 1.16068389 0
4 -1.17557533 0.6473760 -0.2219288 0.062722849 0.06145763 0.14053401 0
5 0.14126698 -0.2060096 0.5022922 0.219422230 0.21515315 -0.07340321 0
The Class column can contain values of 1 or 0.
Here is where I try to fit the model:
fitWeightedLogitModel = function(trainSet) {
model_weights <- ifelse(trainSet$Class == 1,
(1/table(trainSet$Class)[1]) * 0.5,
(1/table(trainSet$Class)[2]) * 0.5)
#model <- glm(Class ~ ., data = trainSet, weights = wt,
# family = binomial())
weighted_model <- train(Class ~ .,
data = trainSet,
method = "glm",
family = binomial(),
verbose = FALSE,
weights = model_weights,
metric = "ROC",
trControl = trainControl(
classProbs = TRUE,
method = "cv",
number = 10,
summaryFunction = twoClassSummary))
return(weighted_model)
}
When I execute that method I get this error:
Error in evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels, :
train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()
In addition: Warning messages:
1: In train.default(x, y, weights = w, ...) :
You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
2: In train.default(x, y, weights = w, ...) :
When I view the stacktrace of the classProbs error:
8.
stop("train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()")
7.
evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels,
metric = metric, method = method)
6.
train.default(x, y, weights = w, ...)
5.
train(x, y, weights = w, ...)
4.
train.formula(Class ~ ., data = trainSet, method = "glm", family = binomial(),
verbose = FALSE, weights = model_weights, metric = "ROC",
trControl = trainControl(classProbs = TRUE, method = "cv",
number = 10, summaryFunction = twoClassSummary))
3.
train(Class ~ ., data = trainSet, method = "glm", family = binomial(),
verbose = FALSE, weights = model_weights, metric = "ROC",
trControl = trainControl(classProbs = TRUE, method = "cv",
number = 10, summaryFunction = twoClassSummary))
2.
fitWeightedLogitModel(train)
1.
executeWeightedLogit()
How can I get past the classProbs error? Because I do have it set to TRUE in trainControl().
Also how would I get past the You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
warning? I thought that method ="glm"
with family = "binomial"
would tell R that I want to use logistic regression and that the response can only be two values.
Best Answer
I can reproduce your error with simulated data.
So the first part of the answer is, train() expects a factor if it is doing classification. glm(..., family="binomial") is cool with 0/1, but train() isn't. So coerce class to a factor by adding:
There is another error that arises, and you can fix that by removing the line
from the call to train().
That now seems to run OK. I get warnings about non-integer number of successes from glm(), but otherwise it is generating OK results.