If there is one subject per row, then method = LOOCV
would do it. You will have to setup your own resampling indicators and supply them via index
. At that point, the value of method
does't matter.
You could do something like:
subs <- unique(dat$subject)
model_these <- vector(mode = "list", length = length(subs))
for(i in seq_along(subs))
model_these[[i]] <- which(dat$subject != subs[i])
names(model_these) <- paste0("Subject", subs)
svmFit <- train(class ~ var1 + var2 + var3 + var4,
data = dat,
method = "svmRadial",
preProc = c("center", "scale"),
tuneGrid = MySVMTuneGrid,
trControl = trainControl(method = "cv",
index = model_these,
classProbs = TRUE))
(note that your test data set converts var
-var4
to character, so I didn't test this.)
Max
train
doesn't save the model information within a fold. You can save the models out to the file system using a custom model:
glmn_funcs <- getModelInfo("glmnet", regex = FALSE)[[1]]
glmn_funcs$fit <- function(x, y, wts, param, lev, last, classProbs, ...) {
theDots <- list(...)
if(all(names(theDots) != "family")) theDots$family <- "multinomial"
modelArgs <- c(list(x = as.matrix(x), y = y, alpha = param$alpha),
theDots)
out <- do.call("glmnet", modelArgs)
if(!is.na(param$lambda[1])) out$lambdaOpt <- param$lambda[1]
save(out, file = paste("~/tmp/glmn", param$alpha,
floor(runif(1, 0, 1)*100), ## to help uniqueness
format(Sys.time(), "%H_%M_%S.RData"),
sep = "_")
out
}
model <- train(x = iris[,-5],
y = iris$Species,
method = glmn_funcs,
type.gaussian = "naive",
tuneGrid = grid,
trControl = ctrl,
preProc = c("center", "scale"))
You can use the coef
function on each model to get the slopes. Note that train
did not fit all possible models, which is
> length(model$control$index)*nrow(grid)
[1] 5500
(omitting the one for the final model). It fits one per unique alpha per fold:
> length(unique(grid$.alpha))*length(model$control$index)
[1] 275
> length(list.files("~/tmp", pattern = "glmn_")) ##includes the final model
[1] 276
So you will have to do some looping using something like:
> params <- coef(out, s = unique(grid$.lambda), type = "nonzeo")
> names(params) ## a matrix per class
[1] "setosa" "versicolor" "virginica"
> lapply(params, dim)
$setosa
[1] 5 20
$versicolor
[1] 5 20
$virginica
[1] 5 20
Lastly, you don't need to prefix a period before the parameter names using recent versions of caret
.
Max
Best Answer
Yes, that is correct. Logistic regression has no hyperparameters to tune over; the estimates for the coefficients will always be given by maximum likelihood. The repeated k-fold cross validation will do nothing to affect the estimates of the parameters/coefficients.
In terms of getting model estimates, none. It will give you the same results as glm. However, you can view the cross-validation results to get some idea of how the model might perform out of sample (on your test set) by looking at:
new_train$resample
, which will give you accuracy and kappa for each resample [so in your 50 (=case number*repeats = 10*5) accuracy and kappa statistics. note that accuracy might be pretty misleading as a proxy for out of sample performance if you have unbalanced data)new_train$results
, which summarises the final results of your cross-validation (the averages taken from your resamples). this includes the average accuracy and kappa, as well as their standard deviationsTrain is much more useful when the model that you are training with has hyperparameters that have to be chosen in order to get an estimate. Let's use Lasso regression as an example. It is essentially OLS with a penalty on the coefficients for overfitting. However, we need to choose HOW MUCH penalty we need to apply to the coefficients. You can see below the picture for how Lasso works. A normal OLS just minimises the sum of squares (first term), but Lasso has a penalty on the squared beta coefficients defined by lambda in the second term. We can use cross-validation on the train function over many different lambda values, and it will select a model with the lambda (penalty) that has the best cross-validation results, given by
new_train$finalModel
. I'll note here that For the logistic regression that you've estimatednew_train$finalModel
isn't very meaninful since there was only ever going to be one model which will be the same model give by glm.In summary, logisitic regression has no hyperparamters, estimates will be found directly via maximum likelyhood estimation. You still have cross validation results but they are only over 1 set of estimates, not over many different hyperparamters values (as they're are none to choose from!).
For some background on the difference between hyperparameters and parameters see: https://datascience.stackexchange.com/questions/14187/what-is-the-difference-between-model-hyperparameters-and-model-parameters