Solved – Appropriate way to calculate cross-validated R square

cross-validationrr-squared

I have a dataset that I divide into two equal partitions A and B.

I estimate a regression model on partition A.

I want to calculate the cross-validated $R^2$ when predicting the values in partition B.

I would like to know if the following approach is correct and also what other ways there could be:

#generate data:

data <- replicate(10, rnorm(100))
data <- as.data.frame(data)

#divide into training and test set:

train <- data[1:50,]
test <- data[51:100,]

#fit model and get predictions for unseen data:

model <- lm(train[,1] ~., data = train)
predictions <- predict(model, test)

#obtain cross-validated R squared:

cor(predictions,test[,1])^2

Best Answer

That is incorrect because it allows for recalibration of predictions with a new overall slope and intercept. Use this formula after freezing all coefficients: 1 - (sum of squared errors) / (sum of squares total). The denominator is $(n-1)\times$ the observed variance of $Y$ in the holdout sample. When you do it correctly you can get negative $R^2$ in some holdout samples when the real $R^2$ is low.

Related Question