Solved – Difference between Adjusted R Squared and Predicted R Squared

multiple regressionr-squaredregression

As we all know, depending solely on R squared to determine goodness of fit is faulty because the R squared value will always increase when more predictors are added, regardless of their physical correlation. With this in mind, many people recommend using the adjusted R squared value, especially if the equation will be used to estimate future values.

However, I recently stumbled upon something called the predicted R squared value, which relies on the predicted residual error sum of squares (PRESS) statistic. In this method,

  • A data point from your dataset is removed
  • A refitted linear regression model is generated
  • The removed data point is plugged into the refitted linear model, generating a predicted value
  • The removed data point is placed back into your dataset. Repeat from step 1 for the next data point until all data points have had a chance to be removed.

Afterward, you get a vector of predicted values which you subtract from their analogous true values to get a predicted residual value, which can be squared to convert it into a PRESS value. When calculating the predicted R Squared, the PRESS effectively replaces the sum of squares residuals value in the R Squared formula.

My question is which method is better at taking into account overmodeling, adjusted R Squared or predicted R Squared? The adjusted R Squared formula relies on the R squared value and the dataset size and predictor number, but the predicted R Squared completely re-calculates the sum of squares residual.

A part of a project I'm working on requires creating a multilinear regression model to correlate predictors (fictional example) such as grass length, intensity of green grass color, quantity of grass per square area, etc. (x variables) to cow happiness (y variable). The adjusted R squared and predicted R squared values react completely differently when I go from 3 to 4 to 5 predictors. The adjusted R squared value stays pretty much constant around 91% from 3 to 5 predictors. However, the predicted R squared value decreases from 87% to 71% to 60%. Which R Squared value applies more here?

Any help is appreciated. Thanks!

Best Answer

I won't attempt to give you a highly technical answer here.

I am inclined to trust the PRESS statistic more than I would trust adjusted R squared. The adjusted R squared is still an "in-sample" measure, while the PRESS is an "out-of-sample" measure. I would consider the out-of-sample measure to be more powerful generally.

Additionally, the adjusted R squared might not be very different from R squared if you have a small number of predictors compared to the number of observations you are fitting. So it might not be as informative as the PRESS statistic.

However, If what you are describing is correct, and the PRESS statistic is calculated correctly, then it is clear that the predictive performance of your model is suffering when you add predictors. But it is also not clear why your PRESS statistic is suffering so much compared to the adjusted R squared. It doesn't feel like that should be the case.

I would recommend checking that you are calculating the PRESS statistic correctly (if this is not something that is provided by the tool you are using to fit the model).

It is quite difficult to diagnose what is happening without having more detail of the issue at hand.