I am using multiple linear regression with a data set of 72 variables and using 5-fold cross validation to evaluate the model.
I am unsure what values I need to look at to understand the validation of the model. Is it the averaged R squared value of the 5 models compared to the R squared value of the original data set? In my understanding, the average R squared value of the sampled data needs to be within 2% of the R squared value in the original data set. Is that right? Or are there any other results I should be looking at?
Best Answer
It is neither of them. Calculate mean square error and variance of each group and use formula $R^2 = 1 - \frac{\mathbb{E}(y - \hat{y})^2}{\mathbb{V}({y})}$ to get R^2 for each fold. Report mean and standard error of the out-of-sample R^2.
Please also have a look at this discussion. There are a lots of examples on the web, specifically R codes where $R^2$ is calculated by stacking together results of cross-validation folds and reporting $R^2$ between this chimeric vector and observed outcome variable
y
. However answers and comments in the discussion above and this paper by KvaĚŠlseth, which predates wide adoption of cross-validation technique, strongly recommends to use formula $R^2 = 1 - \frac{\mathbb{E}(y - \hat{y})^2}{\mathbb{V}({y})}$ in general case.There are several things which might go wrong with the practice of (1) stacking and (2) correlating predictions.
1. Consider observed values of
y
in the test set:c(1,2,3,4)
and prediction:c(8, 6, 4, 2)
. Clearly prediction is anti-correlated with the observed value, but you will be reporting perfect correlation $R^2 = 1.0$.2. Consider a predictor that returns a vector which is a replicated mean of the train points of
y
. Now imagine that you sortedy
and before splitting into cross-validation (CV) folds. You split without shuffling, e.g. in 4-fold CV on 16 samples you have following fold ID labels of the sortedy
:When you split you sorted
y
points, the mean of the train set will anti-correlate with the mean of the test set, so you get a low negative Pearson $R$. Now you calculate a stacked $R^2$ and you get a pretty high value, though your predictors are just noise and the prediction is based on the mean of the seeny
. See figure below for 10-fold CV