Solved – Is Predicted R-squared a Valid Method for Rejecting Additional Explanatory Variables in a Model

factor analysisr-squaredregressiontime series

I'm building a model to understand the important drivers from a set of possible drivers for a time series of data. In my case the possible drivers are other time series.

Like most statistical models I can always add additional drivers and improve the quality of my fit (measured by variance explained). In this case I'm using forward selection to add additional drivers requiring that the variance explained improve by a least certain percentage to determine whether I should add more drivers at all. This given percentage feels arbitrary depending on its value I may overfit.

I was wondering if improvement in Predicted R^2 (definition from minitab.com below) would be a more consistent and better performing method for understanding when to stop adding additional drivers?

Predicted R2 is calculated by systematically removing each observation from the data set, estimating the regression equation, and determining how well the model predicts the removed observation.

Best Answer

Predicted R squared would be no different than many other forms of cross-validation estimates of error (e.g., CV-MSE).

That said, R^2 isn't a great measure since R^2 will always increase with additional variables, regardless of whether that variable is meaningful. For example:

> x <- rnorm(100)
> y <- 1 * x + rnorm(100, 0, 0.25)
> z <- rnorm(100)
> summary(lm(y ~ x))$r.squared
[1] 0.9224326
> summary(lm(y ~ x + z))$r.squared
[1] 0.9273826 

R^2 doesn't make a good measure of model quality because of that. Information based measures, like AIC and BIC, are better.

This is especially true in a time series application where you expect your error terms to be auto-correlated. You should probably be looking at a time series model (ARIMA would be a good place to start) with exogenous regressors to account for the auto-correlation. As is, your model is likely massively overstating the explained error and inflating your R^2.

I'd strongly encourage you to look at time series modeling and AIC based measures of model fit.

EDIT: I wrote a little simulation to compute PRESS and the predicted R^2 for some simulated data and compared it against AIC.

sim <- function() {
  x <- rnorm(100)
  y <- 1 * x + rnorm(100, 0, .25)
  z <- rnorm(100)

  summary(lm(y[-1] ~ x[-1]))$r.squared
      summary(lm(y[-1] ~ x[-1] + z[-1]))$r.squared

  d <- rep(NA, 100)
  press1 <- press2 <- rep(NA, 100)
  for (i in 1:100) {
    yt <- y[i]
    x2 <- x[-i]
    y2 <- y[-i]
    z2 <- z[-i]
    b1 <- coef(lm(y2[-1] ~ x2[-1]))
    b2 <- coef(lm(y2[-1] ~ x2[-1] + z2[-1]))
    press1[i] <- (yt - (b1) %*% c(1, x[i]))^2
    press2[i] <- (yt - (b2) %*% c(1, x[i], z[i]))^2
  }
  sst <- sum((y - mean(y))^2)
  p1 <- 1 - sum(press1)/sst
  p2 <- 1 - sum(press2)/sst

  a1 <- AIC(lm(y[-1] ~ x[-1]))
  a2 <- AIC(lm(y[-1] ~ x[-1] + z[-1]))
  c(p1 >= p2, a1 <= a2)
}

sim()

x <- replicate(100, sim())

Both methods preferred the better model about 85% of the time. AIC has the benefits on a stronger theoretical basis and generalizes better to other methods (e.g., GLM where R^2 is not defined).

The bigger issue here is using a linear model on something with likely autocorrelated errors (a time series).

Using a dataset (Seatbelts in R) to estimate the effect of a seatbelt law, when I use just a linear model and adjust for gas price and distance driven the law's effect is estimated as -11.89 with a standard error of 6.026.

If I account for the fact that the data is correlated with itself and estimate the law effect in the context of an ARIMA model, I estimate the law's effect as -20 and with a standard error of 7.9.

Because the linear model ignored the time series properties, the estimate was off by 2 fold and the standard error of the major variable of interest was underestimated. The same thing (but worse) happens with the gas price and distance variables.