Solved – Interpreting Cross Validation with Multiple Linear Regression

cross-validationlinearmultiple regressionrregression

I am using R to develop a multiple linear regression model for some data I have. I do not have a lot of data points (about 30, sorry very hard to collect the data) and am trying different regression models. When I use a very large regression model (see below) I get R2 of 0.974 and adjusted R2 of 0.965. The RMSE is 0.339. When I use 5-fold cross validation the RMSE for the cross validation is 0.584. 10-fold and 2-fold cross validation also give similar larger RMSE values.

How do I interpret this? Does this mean the model is overfitting? Should I aim to have the cross validation RMSE about equal to the full model RMSE?

mobig.fit <- lm(y ~ x1+x2+x3+x1:x2+x2:x3+x1:x3+x1:x2:x3, data=datas)

#Print r-squared values
summary(mobig.fit)$r.squared
summary(mobig.fit)$adj.r.squared

#Get RMSE
rss.mobig <- c(crossprod(mobig.fit$residuals))
mse.mobig <- rss.mobig / length(mobig.fit$residuals)
rmse.mobig <- sqrt(mse.mobig)

#Cross validate results
cv.mobig <- CVlm(datas,mobig.fit,m=5,plotit=FALSE)
cv.mobig.rmse <- sqrt(attr(cv.mobig,"ms"))
cat("RMSE for full model: ",rmse.mobig)
cat("RMSE for CV: ",cv.mobig.rmse)

Best Answer

It looks like your "full model" is trained on the whole dataset and your RMSE is the error on the training set. This will almost always be lower than the error you'll find when doing a train/test split or CV, since the algorithm was not trained on the data you're using to evaluate RMSE. A high error on the test/CV set can be indicative of overfitting, but you'll virtually always have at least somewhat worse performance on the test set.

The key here is that your "full model" RMSE is a terribly biased evaluation of the model. Intuitively, the model could just "memorize" the training data and spit out pefect predictions for those cases, but it wouldn't generalize well for cases that it hasn't seen (i.e. a held-out test set).

Related Question