Solved – How to compute R-squared value when doing cross-validation

cross-validationmultiple regressionr-squaredregression

I am using multiple linear regression with a data set of 72 variables and using 5-fold cross validation to evaluate the model.

I am unsure what values I need to look at to understand the validation of the model. Is it the averaged R squared value of the 5 models compared to the R squared value of the original data set? In my understanding, the average R squared value of the sampled data needs to be within 2% of the R squared value in the original data set. Is that right? Or are there any other results I should be looking at?

Best Answer

It is neither of them. Calculate mean square error and variance of each group and use formula $R^2 = 1 - \frac{\mathbb{E}(y - \hat{y})^2}{\mathbb{V}({y})}$ to get R^2 for each fold. Report mean and standard error of the out-of-sample R^2.

Please also have a look at this discussion. There are a lots of examples on the web, specifically R codes where $R^2$ is calculated by stacking together results of cross-validation folds and reporting $R^2$ between this chimeric vector and observed outcome variable y. However answers and comments in the discussion above and this paper by KvaĚŠlseth, which predates wide adoption of cross-validation technique, strongly recommends to use formula $R^2 = 1 - \frac{\mathbb{E}(y - \hat{y})^2}{\mathbb{V}({y})}$ in general case.

There are several things which might go wrong with the practice of (1) stacking and (2) correlating predictions.

1. Consider observed values of y in the test set: c(1,2,3,4) and prediction: c(8, 6, 4, 2). Clearly prediction is anti-correlated with the observed value, but you will be reporting perfect correlation $R^2 = 1.0$.

2. Consider a predictor that returns a vector which is a replicated mean of the train points of y. Now imagine that you sorted y and before splitting into cross-validation (CV) folds. You split without shuffling, e.g. in 4-fold CV on 16 samples you have following fold ID labels of the sorted y:

foldid = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4)
y = c(0.09, 0.2, 0.22, 0.24, 0.34, 0.42, 0.44, 0.45, 0.45, 0.47, 0.55, 0.63, 0.78, 0.85, 0.92, 1)

When you split you sorted y points, the mean of the train set will anti-correlate with the mean of the test set, so you get a low negative Pearson $R$. Now you calculate a stacked $R^2$ and you get a pretty high value, though your predictors are just noise and the prediction is based on the mean of the seen y. See figure below for 10-fold CV

simulation: