I have a sample of 415 observations. With a sample with this size, it's possible to use cross validation?
Solved – Cross validation and small samples
cross-validationsmall-sample
Related Solutions
Taking theoretical considerations aside, Akaike Information Criterion is just likelihood penalized by the degrees of freedom. What follows, AIC accounts for uncertainty in the data (-2LL) and makes the assumption that more parameters leads to higher risk of overfitting (2k). Cross-validation just looks at the test set performance of the model, with no further assumptions.
If you care mostly about making the predictions and you can assume that the test set(s) would be reasonably similar to the real-world data, you should go for cross-validation. The possible problem is that when your data is small, then by splitting it, you end up with small training and test sets. Less data for training is bad, and less data for test set makes the cross-validation results more uncertain (see Varoquaux, 2018). If your test sample is insufficient, you may be forced to use AIC, but keeping in mind what it measures, and what can assumptions it makes.
On another hand, as already mentioned in comments, AIC gives you asymptomatic guarantees, and it's not the case with small samples. Small samples may be misleading about the uncertainty in the data as well.
should i divide it in a train and test set with for example 375 (75%) train observations and 125 (25%) test observations, and perform cross-validation on the train set?
Yes
Or should I perform the cross-validation on the entire data set?
No
The test set should be handled independently of the training set so you could do a separate CV block for the test set if you really wish, and may provide some useful insight but is not universal practice. CV may be useful if you plan to apply the model to a completely new set of ‘real world’ data. Given that the test set is drawn from the same population as the training set this may not be that useful as you would expect it to have similar charcateristics to the training set if the split was performed correctly and without bias. Mind you, may be worth checking this assumption.
What is the purpose of CV?
So long as the aim of performing cross-validation is to acquire a more robust estimate of the test MSE
This is not the purpose of CV, rather it is to estimate the robustness of your performance metrics. As @user86895 states it does not measure MSE, see Mean squared error versus Least squared error, which one to compare datasets? for further reading. CV creates multiple models on subsets of the data and applies them to the data withheld from that subset. It iterates over the dataset, building new models until all have been included in training subsets and all have been included in test subsets. The final model is built on all the training set not any of the individual CV round models, the purpose of CV is not to build models but to assess stability of the model performance, i.e. how generalisable the model is.
When comparing different data processing or analysis algorithms on a dataset it provides a first filter to identify the work pathways that provide the most stable models. It does this by providing estimates of how variable the performance is between sub-sets of your training set. This allows you to detect models with a very high risk of overfitting and filter them out. Without cross validation you would be picking based solely on the maximum performance without concern to its stability. But when you come to apply a model in a deployed situation its stability (relevance across the real world population) will be more important than moderate differences in raw performance on a subset of curated samples (i.e you original experimental set).
Cross validation is in fact essential for choosing the crudest parameters for a model such as number of components in PCA or PLS using the Q2 statistic (which is R2 but on the held out data, see What is the Q² value for each component of a PCA) to determine when overfitting starts to degrade model performance.
If I am mistaken, how could I use the cross-validation result to predict out of sample observations?
I am taking this to mean 'how can I use CV result to estimate performance beyond my experimental set?', but will update this section of my answer if it is clarified differently.
CV is used as a first line estimate of model stability, not to estimate performance in real world settings. The only way to do this is to test the final model in a real-world situation. What CV does is provide you a risk analysis, if it appears stable then you could decide it is time to risk the model on a real-world test. If it is not stable then you need to probably expand your training set considerably (ensuring an even representation of important sub groups and confounding factors as these are one source, other than random noise, of overfitting as all relevant variation needs to be given an equal exposure to the model building process to be properly weighted for) and build a new model.
And a note on real world validation, if it works it doesn’t prove your model is generalisable, only that it works under the specific mechnisms whereby it has been deployed in the real-world.
Best Answer
Generally speaking, as you decrease sample size, your Cross Validation (CV) variance will increase. Furthermore, it is also dependent on the dimensionality of the dataset (i.e. if you have many more variables than samples). If you have a very high dimensional dataset your variance may be higher and it may be best to pursue feature selection methods (other topic).
As for an absolute cutoff, there exists no such thing. There was a study, admittedly slightly dated, on whether CV is appropriate for small sample sizes in Microarray Studies. These samples sizes were all less than 120. Their conclusions mirror what I mentioned above, however, there is also the point that bootstrap methods can be an alternative but with higher computational cost and some increased bias.
Another point to consider is the class distribution of your data. Is your data binary or multiclass? Are your classes balanced? There are many methods to address this situation as well if one class is rarer than the other(s) such as stratified CV among others. All in all, I suspect you shouldn’t have a problem applying CV to 415 samples. Even with 10-fold CV you would have ~40 samples in each fold, which is far more than many published studies can boast (biological literature).