Solved – R/caret: train and test sets vs. cross-validation

caretcross-validationmachine learningr

This may be perhaps a silly question, but when generating a model with caret and using something like LOOCV or (even more to the point) LGOCV, what is the benefit of splitting data into train and test sets if this is essentially what the cross-validation step does anyway?

I read some of the related questions and they suggested that some of the cross-validation methods (e.g. what is described here at the caret site) are for the purpose of feature selection. But in my case, I'm using randomForest (method = "rf") and kernlab (method = svmRadial), which aren't listed in the group that attempts to purge predictors.

So, my question is if I use something like cross_val <- trainControl(method = "LGOCV", p = 0.8), isn't that the same as training on 80% of my data, testing the resultant model on the remaining 20%, and doing that over and over to get an idea of how well the model is working?

If so, is there any need to split my data into train/test sets?

P.S. I partly ask as I'm conducting models on empirically generated DOE prototypes (think hard goods where we tweak inputs and then use test methods to measure various attributes about the prototype).

As such, I don't have a huge data set with lots of overlapping predictor levels to model from — we often run one trial at each DOE point of interest since data generation is expensive in this case. Thus, I'd like to use all the data I can for an accurate model, but wanted to check here that I'm not missing something obvious and making a poor model by not splitting things up.


Edit: In response to @topepo's question, I'm modeling physically measured attributes of a compound based on adjusting the chemical inputs of the formula. I can't discuss my actual application, but I'll make up an example based on formulating interior latex paint. I'm running designed experiments where we blend 4-5 chemicals, maybe play with % solids, and an amount of time to heat the polymer solution to adjust the degree of polymerization.

We then might measure rheology, molecular weight, hardness of the paint coating, water resistance, etc.

We have decent replicates of several variables, but few true replicates in the sense that every DOE level was exactly the same. Total data set is ~80 observations and maybe 4-5 are exact repeats. We've conducted 15 different tests, and perhaps 5-6 of them have been done for every single observation. Some of the responses are present for 25-50% of the data.

From here, we'd like to model the effects of our 7 predictors on the output properties and then optimize to target new design spaces that are most likely to give the desired properties.

(Hence my question HERE. Once I have a trained model, it would be nice to do the "reverse" and feed in desired responses to get the best guess at possible input levels to try next).

Best Answer

My general thoughts:

So when you are evaluating different models, you may want to tune them, try different types of pre-processing etc until you find what you think is a good model. Resampling can help guide you in the right direction during that process.

However, there is still the chance of over-fitting and the odds of this happening is greatly influenced by how much data (and predictors) you have. If you have a little bit of data, there are a few ways to think about this:

  • Use all the data for training since every data point adds significantly to how well the model does.
  • Set aside a small test set as a final check for gross errors due to over-fitting. The chances of over-fitting with a small samples size is not small and gets bigger with the number of samples.

I fall into the second camp but the first isn't wrong at all.

If you have a ton of data then it doesn't really matter much (unless you ave a small event rate).

For you:

You have a DOE. The type of design would help answer the question. Are you trying to interpolate between design points or predict design points that have not been tested so far?

You have one replicate. I fell like random forest is hitting a nail with a sledge hammer and might result in over-fitting. I would try something smoother like an SVM or (gasp) neural network.

Max

Related Question