Cross-Validation Bootstrap Resampling – Understanding the Difference Between Bootstrap and Resampling Techniques

bootstrapcross-validationresampling

I am using biological / microarray data. For example, one of my datasets has 50 samples, and 1000 gene attributes. They have 2 labels, Normal and Disease. I usually use a machine learning method such as SVM to classify them. In my field, 10-fold cross-validation is used very frequently, but because the sample size is small, robust results are difficult to acquire. So I usually use resampling to test. For example, make 100 subsets each of which contains 40 samples (80% of 50 samples) randomly from 50 samples.

Recently, I learned the concept of the bootstrap, which means sampling with replacement. In my resampling example, there will be at least 60% (30) of overlap for 2 sets. So, I think bootstrap is the same concept as resampling, is it right understanding?

And, I am using moderate ratio (80%) for resampling, is there any rule to decide it? There are some rules like 632+ in bootstrap, but I think they did not suggest the ratio of resampling.

Best Answer

The immediate answer is that these are two very different techniques, with very different purposes. To make things clear let me explain the techniques one by one, and then talk about them in this context.

Resampling for Cross Validation

When you use cross validation (CV), you are trying to make sure that your model is capturing the signal in your data, and not the noise -- in other words, the part that's similar among all the data (signal), and not the part that's unique to just the data you have (noise).

If you're capturing the signal, then you should be able to predict that signal in the hold-out data, even though you didn't use it in the original model, and that should give you good classification accuracy. If you're capturing the noise, your model will not be able to predict anything in the hold-out data, because the noise is unique to the data you used in the original model. As a result, you get bad classification accuracy.

The reason to HAVE to make sure your training data and testing data are separate is that if you put the same noise in both the training and testing data, you'll think you're predicting signal, when in fact, you're just using one unique data point to predict itself.

Resampling for a Bootstrap

In general, the bootstrap is not used for cross-validation (I have never seen it done). It is used to calculate confidence intervals. You sample from your original data with replacement, reestimate your quantities of interest with the resampled data, and then see how much those estimates vary -- that variance can be used to calculate a confidence interval.

Bootstrapping also involves resampling, but the reason it does it differently is that it has a different goal in mind. To quote from ttnphns's helpful comment, "bootstrapping simulates/approximates the asymptotic interval estimation (under assumption that the sample and the population distributions are isomorphic) of the infinite/large population and that the only sampling strategy which may correctly link a finite set with an infinite one is the replacement."

Meaning, loosely, if your data is representative of your underlying distribution, then you can use a bootstrap method to infer that underlying distribution, but doing that requires sampling with replacement.

Comparing them in this context

You mention "your sampling" a lot without it being clear what you're referring to, so first, it's important to make sure that, however you're choosing your train and test sets (for a single fold), there is NO overlap between them.

Second, "bootstrap" is not the same as "resampling," or "sampling with replacement," in general. Many things have sampling with replacement. Just because there is sampling with replacement does not mean something is a bootstrap.

Here though, there IS something you can do in this context (that might be okay to call a "bootstrap"), and that is with your prediction accuracy. You also never mention your prediction accuracy, so I'm guessing somewhat that that's what you mean, but it's not a bad idea, and here's what it is and why it works (I think, and whether or not it's really a "bootstrap").

You say your "results" are not robust because you don't have much data. By "results," I'm going to assume you mean "cross-validation accuracy." And if there's a lot of variance in your calculated CV-accuracy, then I see why only using a small number of folds isn't going to be enough -- you want a clearer picture of the full distribution of what your hold-out accuracy might be. And to do this, you're randomly choosing 40 data points, and predicting on 10 -- randomly choosing 40 new data points, and predicting on the remaining 10. Each time you get a new accuracy.

I think it is okay to say that you are using a bootstrap method, in order to find a confidence interval for your cross-validation accuracy (but I'm also not entirely certain). You are randomly sampling points from a "cross-validation space." The space is complicated, and I'm not sure what implications that might have -- but you are sampling the points in it randomly, and I think the distribution you get back, of possible accuracies, is meaningful. (ASSUMING, always, that your 50 data points really are representative of your underlying distribution. How good an assumption THAT is though, I have NO idea -- that's your job.) Ultimately, this is not really different from n-fold CV, where a single point can be in the training set as many as $N-1$ different times. You're just taking a random subset of those results, but if it's random, I think you're okay.

(And if you're worried, you could also cache (save) the indices of the samples you've trained on, so that you don't repeat observations. But, you should be safe -- 50 choose 10 is ten billion.)

Last, re: choosing a split (like 80%,) there is no solid rule. But the whole point is to make sure you're not over-fitting, and because cross-validation can help test for over-fitting, you can try a number of different splits (between 50%-90%, say) and see which of those gives the best testing accuracy. (Just make sure you're not over-fitting this number, too!) Here, because you want an lot of points, I would start with leave-one-out-cross-validation (LOOCV) and compare it to the 80% you're using now.

Related Question