Regression – Should R-Squared Be Calculated on Training Data or Test Data?

machine learningr-squaredregression

When calculating the $R^2$ value of a linear regression model, should it be calculated on the training dataset, test dataset or both and why?

Furthermore, when calculating $SS_{\text{res}}$ and $SS_{\text{tot}}$ as per the wikipedia article above, should both sums be over the same data set? In other words, if calculating $SS_{\text{res}}$ over the training dataset, does that require that $SS_{\text{tot}}$ also be calculated over the training dataset? (and similarly for the test dataset.)

Best Answer

The test data shows you how well your model has generalized. When you run the test data through your model, it is the moment you've been waiting for: is it good enough?

In the machine learning world, it is very common to present all of the train, validation and the test metrics, but it is the test accuracy that is the most important.

However, if you get a low $R^2$ score on one, and not the other, then something is off! E.g. If the $R^2_{\text{test}}\ll R^2_{\text{training}}$, then it indicates that your model does not generalize well. That is, if e.g. your test set only contains "unseen" data points, then your model would not appear to extrapolate well (aka a form of covariate shift).

In conclusion: you should compare them! However, in many cases, it's the test-set results you're most interested in.

Related Question