Solved – Appropriate way to calculate cross-validated R square

cross-validationrr-squared

I have a dataset that I divide into two equal partitions A and B.

I estimate a regression model on partition A.

I want to calculate the cross-validated $R^2$ when predicting the values in partition B.

I would like to know if the following approach is correct and also what other ways there could be:

#generate data:

data <- replicate(10, rnorm(100))
data <- as.data.frame(data)

#divide into training and test set:

train <- data[1:50,]
test <- data[51:100,]

#fit model and get predictions for unseen data:

model <- lm(train[,1] ~., data = train)
predictions <- predict(model, test)

#obtain cross-validated R squared:

cor(predictions,test[,1])^2

Best Answer

That is incorrect because it allows for recalibration of predictions with a new overall slope and intercept. Use this formula after freezing all coefficients: 1 - (sum of squared errors) / (sum of squares total). The denominator is $(n-1)\times$ the observed variance of $Y$ in the holdout sample. When you do it correctly you can get negative $R^2$ in some holdout samples when the real $R^2$ is low.

What is the purpose of CV?

So long as the aim of performing cross-validation is to acquire a more robust estimate of the test MSE

This is not the purpose of CV, rather it is to estimate the robustness of your performance metrics. As @user86895 states it does not measure MSE, see Mean squared error versus Least squared error, which one to compare datasets? for further reading. CV creates multiple models on subsets of the data and applies them to the data withheld from that subset. It iterates over the dataset, building new models until all have been included in training subsets and all have been included in test subsets. The final model is built on all the training set not any of the individual CV round models, the purpose of CV is not to build models but to assess stability of the model performance, i.e. how generalisable the model is.

When comparing different data processing or analysis algorithms on a dataset it provides a first filter to identify the work pathways that provide the most stable models. It does this by providing estimates of how variable the performance is between sub-sets of your training set. This allows you to detect models with a very high risk of overfitting and filter them out. Without cross validation you would be picking based solely on the maximum performance without concern to its stability. But when you come to apply a model in a deployed situation its stability (relevance across the real world population) will be more important than moderate differences in raw performance on a subset of curated samples (i.e you original experimental set).

Cross validation is in fact essential for choosing the crudest parameters for a model such as number of components in PCA or PLS using the Q2 statistic (which is R2 but on the held out data, see What is the Q² value for each component of a PCA) to determine when overfitting starts to degrade model performance.

If I am mistaken, how could I use the cross-validation result to predict out of sample observations?

I am taking this to mean 'how can I use CV result to estimate performance beyond my experimental set?', but will update this section of my answer if it is clarified differently.

CV is used as a first line estimate of model stability, not to estimate performance in real world settings. The only way to do this is to test the final model in a real-world situation. What CV does is provide you a risk analysis, if it appears stable then you could decide it is time to risk the model on a real-world test. If it is not stable then you need to probably expand your training set considerably (ensuring an even representation of important sub groups and confounding factors as these are one source, other than random noise, of overfitting as all relevant variation needs to be given an equal exposure to the model building process to be properly weighted for) and build a new model.

And a note on real world validation, if it works it doesn’t prove your model is generalisable, only that it works under the specific mechnisms whereby it has been deployed in the real-world.

Solved – Appropriate way to get Cross Validated AUC

As Provost explains in 'An Introduction to ROC Analysis', ROC averaging can be simply done by combining the scores from multiple sets $T_1, ..., T_k$ as you suggested in method (2). This is preferred to method (1) since it can be quite hard to average actual ROC curves, since the specificity (x-axis) values of the points are expected to be different. Therefore, you would need to do a lot of interpolation to average the curves. Another advantage is that the resulting curve from method (2) is smoother and approximates the AUC better, as a low number of scores tends to underestimate the AUROC (at least when calculated via the trapezoidal rule).

However, one should note that an advantage of method (1) is that it enables you to estimate the variance of the AUC.

Best Answer

Related Solutions

Solved – Perform cross-validation on train set or entire data set

What is the purpose of CV?

Solved – Appropriate way to get Cross Validated AUC

Related Question