Suppose we have training data set and a test data set. The outcome variable is binary. Is it usually necessary to split the training data set so that there is a cross validation data set? Or can you use the whole training data set to build a model and the use this model on the test data set? For logistic regression, for example, would cross validation really help? If so, what type would be best?
Solved – Is cross validation needed
cross-validationlogistic
Related Solutions
Your current strategy will lead to overfitting. Note that dredge
is essentially a form of best subsets selection. (The function name is rather evocative.) Such procedures are ill-advised in general (see my answer here: Algorithms for automatic model selection).
In addition to overfitting, only cross-validating the selected model will give you an over-optimistic estimate of the model's out of sample performance. Instead, you could include the entire model selection process in the cross-validation. For example, imagine you are doing 10-fold cross validation. On your first iteration, you would use the first nine folds to fit the models and select the best one, the selected model would then be applied to the tenth fold to assess its out of sample performance. Note that the models selected in this way may differ from one iteration to the next. This approach tells you the out of sample performance of a model selected in this way, rather than the out of sample performance of a particular model that has already been selected.
Regarding how to do this in R
, there are a number of pre-existing functions and packages to help you with cross-validation. There is a helpful overview of several options here (pdf). You may also want to check out the caret package. To do some form of customized cross-validation, you may need to code it up yourself, though.
Nested cross-validation and repeated k-fold cross-validation have different aims. The aim of nested cross-validation is to eliminate the bias in the performance estimate due to the use of cross-validation to tune the hyper-parameters. As the "inner" cross-validation has been directly optimised to tune the hyper-parameters it will give an optimistically biased estimate of generalisation performance. The aim of repeated k-fold cross-validation, on the other hand, is to reduce the variance of the performance estimate (to average out the random variation caused by partitioning the data into folds). If you want to reduce bias and variance, there is no reason (other than computational expense) not to combine both, such that repeated k-fold is used for the "outer" cross-validation of a nested cross-validation estimate. Using repeated k-fold cross-validation for the "inner" folds, might also improve the hyper-parameter tuning.
If all of the models have only a small number of hyper-parameters (and they are not overly sensitive to the hyper-parameter values) then you can often get away with a non-nested cross-validation to choose the model, and only need nested cross-validation if you need an unbiased performance estimate, see:
Jacques Wainer and Gavin Cawley, "Nested cross-validation when selecting classifiers is overzealous for most practical applications", Expert Systems with Applications, Volume 182, 2021 (doi, pdf)
If, on the other hand, some models have more hyper-parameters than others, the model choice will be biased towards the models with the most hyper-parameters (which is probably a bad thing as they are the ones most likely to experience over-fitting in model selection). See the comparison of RBF kernels with a single hyper-parameter and Automatic Relevance Determination (ARD) kernels, with one hyper-parameter for each attribute, in section 4.3 my paper (with Mrs Marsupial):
GC Cawley and NLC Talbot, "On over-fitting in model selection and subsequent selection bias in performance evaluation", The Journal of Machine Learning Research 11, 2079-2107, 2010 (pdf)
The PRESS statistic (which is the inner cross-validation) will almost always select the ARD kernel, despite the RBF kernel giving better generalisation performance in the majority of cases (ten of the thirteen benchmark datasets).
Best Answer
Cross validation
has two purposes :when you don't use cross validation and randomly select a part of data as train and other part as test, you may have a high accuracy in that part for train and test but when you select another train and test data you may have lower accuracy. Cross validation methods like
n-fold cross validation
or etc. will help to find best fit model based on your database. with lowest error on all parts of data.In some cases cross validation will help to find some parameters of model like
C
in logistic regression that you can find some documentation about it inMATLAB
help center or inR
documentation files.So as we discoursed
cross validation
has a critical rule to find a reliable model for your database. You should select best cross-validation technique based on your model structure and your sample size.5-fold cross validation
is a well known technique. You can increase thek
in k-fold cross validation If you have more sample size.