Solved – Calculate the MSE for a Linear Regression Model using a Bootstrap

bootstrapmseregression

I'm currently reading the book, An Introduction to Statistical Learning, and I'm struggling a little with the bootstrap approach. As far as I understand, I can use a bootstrap in almost all situations to obtain a standard error for a particular statistic. Does it make sense to use a bootstrap when computing the MSE for a linear regression model? If yes, do I sample both the training data and the test data or do I train the model once and then draw different test sets. In the latter case, do I draw the test data form the same collection of data as the training data or should I always keep my test data separate?

In other words, does the following R code make sense?

MSE <- function(model, data) { ... }

boot.mse <- function(object, data, index) {
  train <- head(index, ceil(length(index) * 0.9))
  test <- tail(index, floor(length(index) * 0.1))
  MSE(lm(object, data[train,]), data[test,]) # Calculate test MSE
}

boot(my_data, boot.mse, 1000, object = some_model_or_formula)

Best Answer

  • I don't use R enough to comment on the code.
  • The rms package has the validate.ols function that will perform the bootstrap for you. You can compare results.
  • The bootstrap in this setting is used to validate the model building process - to assess for overfitting, and provide an valid estimate of out-of-sample performance.
  • The bootstrap is more efficient than split sample validation, which provides reliable/stable estimates only with fairly large sample sizes. If you're using the bootstrap, you're not using test/training sets.
  • Traditionally one builds the model on the bootstrap sample and tests it on the original sample.
  • Comparing internal validation methods: http://www.ncbi.nlm.nih.gov/pubmed/11470385
Related Question