Regression – Interpreting $R^2$ – Coefficient of Determination for Test Data

r-squaredregression

When calculating the $R^2$ value for the coefficient of determination of a linear regression model, it is well known (Wikipedia) that

$SS_{Total} = SS_{Explained} + SS_{Residual}$ (1)

i.e.,
$\sum_{i=1}^n (y_i – \bar{y})^2 = \sum_{i=1}^n (\hat{y_i} – \bar{y})^2 + \sum_{i=1}^n (y_i – \hat{y_i})^2$

and $R^2 = 1 – \frac{SS_{Residual}}{SS_{Total}}$

Here are my questions:

  1. If I apply the train/test split policy for a dataset, does/should equation (1) hold for the testing dataset? NB: I have tried this already and it does not hold for the testing dataset and I am looking for the explanation.

  2. Are there any cases where equation (1) will not hold for the training dataset?

  3. In some textbooks, $R^2 = \frac{SS_{Explained}}{SS_{Total}}$. When should this hold?

Best Answer

  1. No. That formula comes from ordinary least squares with an intercept on the training data. I go through the math here to explain why that fails to hold in general. An important point is that the OLS coefficients for the training data (probably) are not the OLS coefficients for the out-of-sample data, meaning that the $Other$ term in the link cannot be counted on to be zero. Consider the example below.
set.seed(2023)
N <- 10
x_train <- runif(N)
y_train <- 2*x_train + rnorm(N)
x_test <- runif(N)
y_test <- 2*x_test + rnorm(N)
L_train <- lm(y_train ~ x_train)
L_test <- lm(y_test ~ x_test)
summary(L_train)$coef[, 1]
summary(L_test)$coef[, 1]

The coefficients in the test data to give this equality are $\hat\beta_{intercept} = -2.62802$ and $\hat\beta_{slope} = 5.87788$. However, the coefficients being used are from the training data, $\hat\beta_{intercept} = 0.07146299$ and $\hat\beta_{slope} = 2.04047943$.

  1. This always holds for ordinary least squares linear regression with an intercept. If you use a nonlinear regression (even one fit by minimizing the sum of squared residuals), a linear regression fit by a method other than minimizing the sum of squared residuals, or use a linear regression without an intercept and fit the model by minimizing the sum of squared residuals, you no longer meet the conditions to give the decomposition of the total sum of squares into the classical "explained" and unexplained" sums of squares. Again, that $Other$ term from the link cannot be counted on to be zero if you deviate from ordinary least squares linear regression with an intercept. Consider the example below where a linear model $\mathbb Ey = \beta_0 + \beta_1 x$ is fit using a minimization of the sum of absolute residuals instead of squared residuals.
library(quantreg)
set.seed(2023)
N <- 10
x <- runif(N)
y <- 2*x + rnorm(N)
Q <- quantreg::rq(y ~ x) # Minimize absolute loss
L <- lm(y ~ x)           # Minimize square loss, as usual
predictions_Q <- predict(Q)
predictions_L <- predict(L)
SSTotal <- sum((y - mean(y))^2)
SSResidual_Q <- sum((y - predictions_Q)^2)
SSExplained_Q <- sum((predictions_Q - mean(y))^2)
SSResidual_L <- sum((y - predictions_L)^2)
SSExplained_L <- sum((predictions_L - mean(y))^2)
SSTotal - (SSResidual_Q + SSExplained_Q) # Differ by about -2
SSTotal - (SSResidual_L + SSExplained_L) # Differ by essentially zero

When we use a linear model with an intercept yet fit the coefficients using an alternative to ordinary least squares, the decomposition fails. However, in the OLS situation, the decomposition holds within what I consider acceptable bounds of doing math on a computer.

  1. This will hold when the author defines $R^2$ this way. However, it will not coincide with other common definitions of $R^2$, such as the $R^2 = 1 - \frac{SS_{Residual}}{SS_{Total}}$ you gave, unless that $Other$ term from my linked answer equals zero.

Somewhat related is my usual spiel about what $R^2$ should mean. You can calculate whatever you want, but I find one calculation to be most useful. I'll close by quoting another post of mine, which also deals with calculating out-of-sample $R^2$ (though mostly with regards to what the denominator should be).

Finally, definition 4 makes sense. We have some kind of baseline model (naïvely predict $\bar y$ every time, always using the marginal mean as our guess of the conditional mean) and compare our predictions to the predictions made by that baseline model. To draw an analogy to flipping a coin, if someone guesses which side will land up and gets correct predictions less than half the time, that person is a poor predictor. If they are right more than half the time, they are at least improving somewhat upon the naïve “Gee, I don’t know how it’ll land, so I guess I’ll just say heads every time (or tails, or alternate between the two) and get it right about half the time.”