Traditionally, you do three steps of "tuning", so you need to split your sample into three parts: a training set, a cross-validation set and a test set.
Training (~60%)
In training, you simply estimate your model, but you don't make any changes to the model based on the results (accuracy, goodness of fit) from the training data to avoid overfitting the training set.
Cross validation (~20%)
After training your model, you can tune it - vary hyperparameters, remove features, or even select between different models - based on its performance on the cross validation set.
As an example, let's say you want to test which variables to include and which to leave: You specify three different variable combinations (three different models). You train all of them using your training set. Then you evaluate all of them using the cross validation set and select the one that performs best on the CV set.
K-fold CV
If you are interested in doing k-fold validation, you repeat exactly what's written above, with one major difference: instead of hard-selecting the 60% and 20% for your training and CV sets, you run the training and validation procedures K-times, each time selecting a different random subsample for training and cross validation. Then you get a set of K results (accuracy, goodness of fit) that you can average to get a more robust estimate of your model's performance.
E.g., if you do 10-fold CV, you'd run it 10 times, and each time you'd randomly sample 10% of your data to be a cross-validation set, with the rest being a training set.
Test set (~20%)
After tuning the model and/or selecting the best one, you can test it using the test set. This is data that the model has not seen yet and you shouldn't make any changes to the model based on the test set. This is the very last stage of building the model, only used to evaluate your final model, not to tune it any more (you don't want to overfit your test set).
If doing k-fold CV, you still have to leave out a test set that is separate from your training/CV set you are sampling from.
Putting it all together
So in your case, you have $N=240$ and the number of variables is $12$. So the first split of the data would be training/CV (70-80%) and test (20-30%). Which in your case would be $168-192$ for training/CV and $48-72$ for test. Then, in selecting the variables to include, for each model (combination of variables), do K-fold CV as follows:
- Split your training/CV set into K equal (random) subsets.
- Estimate your model K times, each time leaving out one of the K subsets.
- Cross-validate each estimate with the subset that was left out.
- Pool your cross-validation results across all the K estimates.
Then pick the model that performs best in CV (on average). Evaluate it on the test set. Don't change it any more.
This is frequently referred to as repeated training-test split, or leave-group-out cross-validation, or Monte Carlo cross-validation$^1$. Note that this should be done with more splits than regular CV (which would have e.g. 10 partitions) - so you doing it 1000 times is a good idea.
$^1$ See e.g. Applied Predictive Modeling by Kuhn and Johnson, Springer, 2013.
Best Answer
Your experience with split-dependent differences in your modeling and performance estimates is a reason why cross validation is often preferred in all but extremely large data sets. Repeated cross validation or bootstrapping might be even better.
With the separate training/validation/test approach, your estimate of the generalizability of your model's performance on new data comes from the test set, which is set aside until initial training and tuning are done on the training and validation sets. Say that you want to estimate the error in model predictions made on your test set. You are then trying to estimate a variance, which can require a surprisingly large number of cases to estimate precisely.
But if you set aside more cases for the test set so that you have a better measure of generalizability, you have fewer cases available for training and validation, which may limit your ability to generate a useful model in the first place.
The chapter on cross validation and related methods in The Elements of Statistical Learning (2nd edition, p. 222) puts it like this:
So cross validation is a useful approach in cases where you don't have a "large enough" data set to accomplish your goals. Your question suggests that you might be in such a situation despite having 5000 cases.
In practice, a single run of cross validation can give imprecise results. Frank Harrell recommends repeated runs of cross validation or, better, bootstrapping to take advantage of all the data most efficiently in building and evaluating a model. See for example this answer, with a link to further reading in a comment. His rms package provides tools for building, validating, and calibrating models.