Solved – Should logistic regression models generated with and without cross validation in the caret.train function in R be the same

caretlogisticr

I am working with the Titanic dataset and trying to use logistic regression in R to predict survival. The simple approach I tried was to just use the glm function with binomial family and logit link specified:

f = as.formula("Survived_char ~ Pclass_char+Sex+mAgeD+SibSp+Title+FsizeD")
logit <- glm(f, family=binomial(link="logit"), data=new_train)

The next approach was to use the train function in the caret package with cross validation:

tc <- trainControl(method = "repeatedcv", number=10, repeats = 5,
                   classProbs = TRUE, summaryFunction = twoClassSummary)
fit <- train(form=f,data=new_train,method="glm",
             tuneLength=5, trControl=tc,metric="ROC",
             family=binomial(link="logit"))

However, the two models have the same coefficients. Is that correct? I thought that k-fold cross validation would yield a model where the coefficients were averages of the k models developed. If the same model is generated with and without cross validation, what is the advantage of developing a model with the train function rather than using glm directly?

Best Answer

However, the two models have the same coefficients. Is that correct?

Yes, that is correct. Logistic regression has no hyperparameters to tune over; the estimates for the coefficients will always be given by maximum likelihood. The repeated k-fold cross validation will do nothing to affect the estimates of the parameters/coefficients.

What is the advantage of developing a model with the train function rather than using glm directly?

In terms of getting model estimates, none. It will give you the same results as glm. However, you can view the cross-validation results to get some idea of how the model might perform out of sample (on your test set) by looking at:

  1. new_train$resample, which will give you accuracy and kappa for each resample [so in your 50 (=case number*repeats = 10*5) accuracy and kappa statistics. note that accuracy might be pretty misleading as a proxy for out of sample performance if you have unbalanced data)
  2. new_train$results, which summarises the final results of your cross-validation (the averages taken from your resamples). this includes the average accuracy and kappa, as well as their standard deviations

Train is much more useful when the model that you are training with has hyperparameters that have to be chosen in order to get an estimate. Let's use Lasso regression as an example. It is essentially OLS with a penalty on the coefficients for overfitting. However, we need to choose HOW MUCH penalty we need to apply to the coefficients. You can see below the picture for how Lasso works. A normal OLS just minimises the sum of squares (first term), but Lasso has a penalty on the squared beta coefficients defined by lambda in the second term. We can use cross-validation on the train function over many different lambda values, and it will select a model with the lambda (penalty) that has the best cross-validation results, given by new_train$finalModel. I'll note here that For the logistic regression that you've estimated new_train$finalModel isn't very meaninful since there was only ever going to be one model which will be the same model give by glm.

enter image description here

In summary, logisitic regression has no hyperparamters, estimates will be found directly via maximum likelyhood estimation. You still have cross validation results but they are only over 1 set of estimates, not over many different hyperparamters values (as they're are none to choose from!).

For some background on the difference between hyperparameters and parameters see: https://datascience.stackexchange.com/questions/14187/what-is-the-difference-between-model-hyperparameters-and-model-parameters

Related Question