Solved – Calculate the MSE for a Linear Regression Model using a Bootstrap

I'm currently reading the book, An Introduction to Statistical Learning, and I'm struggling a little with the bootstrap approach. As far as I understand, I can use a bootstrap in almost all situations to obtain a standard error for a particular statistic. Does it make sense to use a bootstrap when computing the MSE for a linear regression model? If yes, do I sample both the training data and the test data or do I train the model once and then draw different test sets. In the latter case, do I draw the test data form the same collection of data as the training data or should I always keep my test data separate?

In other words, does the following R code make sense?

MSE <- function(model, data) { ... }

boot.mse <- function(object, data, index) {
  train <- head(index, ceil(length(index) * 0.9))
  test <- tail(index, floor(length(index) * 0.1))
  MSE(lm(object, data[train,]), data[test,]) # Calculate test MSE
}

boot(my_data, boot.mse, 1000, object = some_model_or_formula)

Best Answer

I don't use R enough to comment on the code.
The rms package has the validate.ols function that will perform the bootstrap for you. You can compare results.
The bootstrap in this setting is used to validate the model building process - to assess for overfitting, and provide an valid estimate of out-of-sample performance.
The bootstrap is more efficient than split sample validation, which provides reliable/stable estimates only with fairly large sample sizes. If you're using the bootstrap, you're not using test/training sets.
Traditionally one builds the model on the bootstrap sample and tests it on the original sample.
Comparing internal validation methods: http://www.ncbi.nlm.nih.gov/pubmed/11470385

Best Answer

Related Solutions

Related Question