Solved – Should logistic regression models generated with and without cross validation in the caret.train function in R be the same

caretlogisticr

I am working with the Titanic dataset and trying to use logistic regression in R to predict survival. The simple approach I tried was to just use the glm function with binomial family and logit link specified:

f = as.formula("Survived_char ~ Pclass_char+Sex+mAgeD+SibSp+Title+FsizeD")
logit <- glm(f, family=binomial(link="logit"), data=new_train)

The next approach was to use the train function in the caret package with cross validation:

tc <- trainControl(method = "repeatedcv", number=10, repeats = 5,
                   classProbs = TRUE, summaryFunction = twoClassSummary)
fit <- train(form=f,data=new_train,method="glm",
             tuneLength=5, trControl=tc,metric="ROC",
             family=binomial(link="logit"))

However, the two models have the same coefficients. Is that correct? I thought that k-fold cross validation would yield a model where the coefficients were averages of the k models developed. If the same model is generated with and without cross validation, what is the advantage of developing a model with the train function rather than using glm directly?

Best Answer

However, the two models have the same coefficients. Is that correct?

Yes, that is correct. Logistic regression has no hyperparameters to tune over; the estimates for the coefficients will always be given by maximum likelihood. The repeated k-fold cross validation will do nothing to affect the estimates of the parameters/coefficients.

What is the advantage of developing a model with the train function rather than using glm directly?

In terms of getting model estimates, none. It will give you the same results as glm. However, you can view the cross-validation results to get some idea of how the model might perform out of sample (on your test set) by looking at:

new_train$resample, which will give you accuracy and kappa for each resample [so in your 50 (=case number*repeats = 10*5) accuracy and kappa statistics. note that accuracy might be pretty misleading as a proxy for out of sample performance if you have unbalanced data)
new_train$results, which summarises the final results of your cross-validation (the averages taken from your resamples). this includes the average accuracy and kappa, as well as their standard deviations

Train is much more useful when the model that you are training with has hyperparameters that have to be chosen in order to get an estimate. Let's use Lasso regression as an example. It is essentially OLS with a penalty on the coefficients for overfitting. However, we need to choose HOW MUCH penalty we need to apply to the coefficients. You can see below the picture for how Lasso works. A normal OLS just minimises the sum of squares (first term), but Lasso has a penalty on the squared beta coefficients defined by lambda in the second term. We can use cross-validation on the train function over many different lambda values, and it will select a model with the lambda (penalty) that has the best cross-validation results, given by new_train$finalModel. I'll note here that For the logistic regression that you've estimated new_train$finalModel isn't very meaninful since there was only ever going to be one model which will be the same model give by glm.

In summary, logisitic regression has no hyperparamters, estimates will be found directly via maximum likelyhood estimation. You still have cross validation results but they are only over 1 set of estimates, not over many different hyperparamters values (as they're are none to choose from!).

For some background on the difference between hyperparameters and parameters see: https://datascience.stackexchange.com/questions/14187/what-is-the-difference-between-model-hyperparameters-and-model-parameters

Related Solutions

Solved – Leave-one-subject-out cross validation in Caret

If there is one subject per row, then method = LOOCV would do it. You will have to setup your own resampling indicators and supply them via index. At that point, the value of method does't matter.

You could do something like:

subs <- unique(dat$subject)
model_these <- vector(mode = "list", length = length(subs))
for(i in seq_along(subs)) 
   model_these[[i]] <- which(dat$subject != subs[i])
names(model_these) <- paste0("Subject", subs)
svmFit <- train(class ~ var1 + var2 + var3 + var4,
                data = dat,
                method = "svmRadial",
                preProc = c("center", "scale"),
                tuneGrid = MySVMTuneGrid,
                trControl = trainControl(method = "cv", 
                                         index = model_these, 
                                         classProbs =  TRUE))

(note that your test data set converts var-var4 to character, so I didn't test this.)

Max

Solved – Feature Importance in each fold and repeat after repeated cross validation in caret

train doesn't save the model information within a fold. You can save the models out to the file system using a custom model:

glmn_funcs <- getModelInfo("glmnet", regex = FALSE)[[1]]
glmn_funcs$fit <- function(x, y, wts, param, lev, last, classProbs, ...) {
    theDots <- list(...)  
    if(all(names(theDots) != "family")) theDots$family <- "multinomial"   
    modelArgs <- c(list(x = as.matrix(x), y = y, alpha = param$alpha),
                        theDots)

    out <- do.call("glmnet", modelArgs) 
    if(!is.na(param$lambda[1])) out$lambdaOpt <- param$lambda[1]
        save(out, file = paste("~/tmp/glmn", param$alpha, 
                          floor(runif(1, 0, 1)*100), ## to help uniqueness
                          format(Sys.time(), "%H_%M_%S.RData"),
                          sep = "_")
    out 
  }

model <- train(x = iris[,-5],
               y = iris$Species,
               method = glmn_funcs,
               type.gaussian = "naive",
               tuneGrid = grid,
               trControl = ctrl,
               preProc = c("center", "scale"))

You can use the coef function on each model to get the slopes. Note that train did not fit all possible models, which is

> length(model$control$index)*nrow(grid)
[1] 5500

(omitting the one for the final model). It fits one per unique alpha per fold:

> length(unique(grid$.alpha))*length(model$control$index)
[1] 275
> length(list.files("~/tmp", pattern = "glmn_")) ##includes the final model
[1] 276

So you will have to do some looping using something like:

> params <- coef(out, s = unique(grid$.lambda), type = "nonzeo")
    > names(params) ## a matrix per class
    [1] "setosa"     "versicolor" "virginica" 
    > lapply(params, dim)
    $setosa
[1]  5 20

$versicolor
[1]  5 20

$virginica
[1]  5 20

Lastly, you don't need to prefix a period before the parameter names using recent versions of caret.

Max

Best Answer

Related Solutions

Solved – Leave-one-subject-out cross validation in Caret

Solved – Feature Importance in each fold and repeat after repeated cross validation in caret

Related Question