Solved – How to calculate out of sample R squared

machine learningout-of-sampler-squaredregression

I know this probably has been discussed somewhere else, but I have not been able to find an explicit answer. I am trying to use the formula $R^2 = 1 – SSR/SST$ to calculate out-of-sample $R^2$ of a linear regression model, where $SSR$ is the sum of squared residuals and $SST$ is the total sum of squares. For the training set, it is clear that

$$ SST = \Sigma (y – \bar{y}_{train})^2 $$

What about the testing set? Should I keep using $\bar{y}_{train}$ for out of sample $y$, or use $\bar{y}_{test}$ instead?

I found that if I use $\bar{y}_{test}$, the resulting $R^2$ can be negative sometimes. This is consistent with the description of sklearn's r2_score() function, where they used $\bar{y}_{test}$ (which is also used by their linear_model's score() function for testing samples). They state that "a constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0."

However, in other places people have used $\bar{y}_{train}$ like here and here (the second answer by dmi3kno). So I was wondering which makes more sense? Any comment will be greatly appreciated!

Best Answer

First of all is need to say that for prediction evaluation, then out of sample, the usual $R^2$ is not adequate. It is so because the usual $R^2$ is computed on residuals, that are in sample quantities.

We can define: $R^2 = 1 – RSS/TSS$

RSS = residual sum of square

TSS = total sum of square

The main problem here is that residuals are not a good proxy for forecast errors because in residuals the same data would be used for both, model estimation and model prediction accuracy. If residuals (RSS) are used the prediction accuracy would be overstated; probably overfitting occur. Even TSS is not adequate as we see later. However we have to say that in the past the mistaken use of standard $R^2$ for forecast evaluation was quite common.

The out of sample $R^2$ ($R_{oos}^2$) maintain the idea of usual $R^2$ but in place of RSS is used the out of sample MSE of the model under analysis (MSE_m). In place of TSS is used the the out of sample MSE of one benchmark model (MSE_bmk).

$R_{oos}^2 = 1 – MSE_m/MSE_{bmk}$

One notable difference between $R^2$ and $R_{oos}^2$ is that

$0 \leq R^2 \leq 1$ (if the constant term is included)

while $-\infty \leq R_{oos}^2 \leq 1$

If $R_{oos}^2 < = > 0$ the competing model perform worse/equal/better than the benchmark one. If $R_{oos}^2 =1$ the competing model predict perfectly the (new) data.

Here we have to keep in mind that the even for the benchmark model we have to consider the out of sample performance. Therefore the variance of the out of sample data underestimate $MSE_{bmk}$.

In my knowledge this measure was proposed for the first time in: Predicting excess stock returns out of sample: Can anything beat the historical average? - Campbell and Thompson (2008) - Review of Financial Studies. In it the the bmk forecast is based on the prevailing mean given information at time of the forecast.

Related Question