Solved – Is Predicted R-squared a Valid Method for Rejecting Additional Explanatory Variables in a Model

factor analysisr-squaredregressiontime series

I'm building a model to understand the important drivers from a set of possible drivers for a time series of data. In my case the possible drivers are other time series.

Like most statistical models I can always add additional drivers and improve the quality of my fit (measured by variance explained). In this case I'm using forward selection to add additional drivers requiring that the variance explained improve by a least certain percentage to determine whether I should add more drivers at all. This given percentage feels arbitrary depending on its value I may overfit.

I was wondering if improvement in Predicted R^2 (definition from minitab.com below) would be a more consistent and better performing method for understanding when to stop adding additional drivers?

Predicted R2 is calculated by systematically removing each observation from the data set, estimating the regression equation, and determining how well the model predicts the removed observation.

Best Answer

Predicted R squared would be no different than many other forms of cross-validation estimates of error (e.g., CV-MSE).

That said, R^2 isn't a great measure since R^2 will always increase with additional variables, regardless of whether that variable is meaningful. For example:

> x <- rnorm(100)
> y <- 1 * x + rnorm(100, 0, 0.25)
> z <- rnorm(100)
> summary(lm(y ~ x))$r.squared
[1] 0.9224326
> summary(lm(y ~ x + z))$r.squared
[1] 0.9273826

R^2 doesn't make a good measure of model quality because of that. Information based measures, like AIC and BIC, are better.

This is especially true in a time series application where you expect your error terms to be auto-correlated. You should probably be looking at a time series model (ARIMA would be a good place to start) with exogenous regressors to account for the auto-correlation. As is, your model is likely massively overstating the explained error and inflating your R^2.

I'd strongly encourage you to look at time series modeling and AIC based measures of model fit.

EDIT: I wrote a little simulation to compute PRESS and the predicted R^2 for some simulated data and compared it against AIC.

sim <- function() {
  x <- rnorm(100)
  y <- 1 * x + rnorm(100, 0, .25)
  z <- rnorm(100)

  summary(lm(y[-1] ~ x[-1]))$r.squared
      summary(lm(y[-1] ~ x[-1] + z[-1]))$r.squared

  d <- rep(NA, 100)
  press1 <- press2 <- rep(NA, 100)
  for (i in 1:100) {
    yt <- y[i]
    x2 <- x[-i]
    y2 <- y[-i]
    z2 <- z[-i]
    b1 <- coef(lm(y2[-1] ~ x2[-1]))
    b2 <- coef(lm(y2[-1] ~ x2[-1] + z2[-1]))
    press1[i] <- (yt - (b1) %*% c(1, x[i]))^2
    press2[i] <- (yt - (b2) %*% c(1, x[i], z[i]))^2
  }
  sst <- sum((y - mean(y))^2)
  p1 <- 1 - sum(press1)/sst
  p2 <- 1 - sum(press2)/sst

  a1 <- AIC(lm(y[-1] ~ x[-1]))
  a2 <- AIC(lm(y[-1] ~ x[-1] + z[-1]))
  c(p1 >= p2, a1 <= a2)
}

sim()

x <- replicate(100, sim())

Both methods preferred the better model about 85% of the time. AIC has the benefits on a stronger theoretical basis and generalizes better to other methods (e.g., GLM where R^2 is not defined).

The bigger issue here is using a linear model on something with likely autocorrelated errors (a time series).

Using a dataset (Seatbelts in R) to estimate the effect of a seatbelt law, when I use just a linear model and adjust for gas price and distance driven the law's effect is estimated as -11.89 with a standard error of 6.026.

If I account for the fact that the data is correlated with itself and estimate the law effect in the context of an ARIMA model, I estimate the law's effect as -20 and with a standard error of 7.9.

Because the linear model ignored the time series properties, the estimate was off by 2 fold and the standard error of the major variable of interest was underestimated. The same thing (but worse) happens with the gas price and distance variables.

Summary

You appear to be looking at the associations between symptoms (a, b, c, d, and e, coded as linear, numeric variables) and cancer status (yes versus no, coded in binary).

Associations versus predictions

I think you are looking at associations between the symptoms and cancer status rather than the ability of the symptoms to predict cancer status. If you wanted to really investigate predictive ability, you would need to divide your data set in half, fit models to one half of the data, and then use them to predict the cancer status of the patients in the other half of the data set. Note that this describes the simplest case of validation of a model using a single data set. You shouldn't actually do this. What you could really do is employ n-fold cross validation (for example, using the rms package in R) to make the most efficient use of your data.

Starting off

You may have already done this, but prior to playing around with logistic regression modeling I think you should take a step back and just look at your data. Using the program R to compute a few basic summary statistics...

# Load libraries
library(Rmisc)
library(metafor)

# Load data
data <- read.csv("example_data.csv", header = TRUE, na.strings = "")
attach(data)

# Summarize data
summary(data)
       a              b               c               d               e             cancer      
 Min.   :11.0   Min.   :13.00   Min.   :13.00   Min.   :12.00   Min.   :17.00   Min.   :0.0000  
 1st Qu.:19.0   1st Qu.:27.00   1st Qu.:28.00   1st Qu.:36.00   1st Qu.:33.00   1st Qu.:1.0000  
 Median :24.0   Median :31.00   Median :32.00   Median :40.00   Median :38.00   Median :1.0000  
 Mean   :24.8   Mean   :31.39   Mean   :32.44   Mean   :39.39   Mean   :37.71   Mean   :0.9169  
 3rd Qu.:30.0   3rd Qu.:36.00   3rd Qu.:37.00   3rd Qu.:43.50   3rd Qu.:42.00   3rd Qu.:1.0000  
 Max.   :49.0   Max.   :50.00   Max.   :50.00   Max.   :50.00   Max.   :50.00   Max.   :1.0000  
 NA's   :20     NA's   :18      NA's   :21      NA's   :20      NA's   :20      NA's   :6

And now to plot some exploratory scatter plots... Pay attention to any linear relationships between variables that pop out to your eye. Also pay attention (as Benjamin mentioned below) to the plots of the symptom variables versus cancer status.

plot(data)

Scatter plots

And look at some histograms to get a sense of the distribution of your data... Always good to do this before plugging them into a regression model

 hist(data)

Histograms

Going a bit further...

I would compute the mean and 95%CI for each symptom variable and stratify them by cancer status and plot those... Just by looking at this you will know visually which variables are going to be significant in your logistic regression model. Here I just plot the data...

forest(
x = c(24.44636,28.94667,31.63066,28.62963,32.59910,30.65852,39.79738,35.04111,37.99030,34.41185),
ci.lb = c(23.57979,25.72939,30.84611,26.15883,31.88579,28.52778,39.16493,32.27390,37.26171,32.10734),
ci.ub = c(25.31292,32.16395,32.41520,31.10043,33.31242,32.78926,40.42983,37.80832,38.71888,36.71637),
xlab = "Mean and 95% CI", slab = c("a cancer","a healthy","b cancer","b healthy","c cancer","c healthy","d cancer","d healthy","e cancer","e healthy"))

Forest plot

Looking at the plot above, you get a visual sense of the fact that you have way more cancer patients contributing to the data set than non-cancer patients.

Last...

I would just compute univariate effects estimates for each symptom variable for their associations with cancer outcome. Then I would multiply all of the resultant p values by five, since you are doing that many exploratory tests. You can do that in SPSS easily. For the results of the models, I would focus more on the direction, magnitude, and confidence intervals for the resultant effects estimates. Below I have plotted the effects estimates and their confidence intervals from univariate models of each separate symptom variable... Now you should go build models that are adjusted for age, gender, smoking, etc. and make another plot like this... I do agree with Benjamin that there is probably not a whole lot you can likely learn from these data given the paucity of healthy controls.

Logistic regression results

Best Answer

Related Solutions

Solved – Boosted AR for time series forecasting

Logistic Regression – How to Improve Predictive Power in SPSS

Summary

Associations versus predictions

Starting off

Going a bit further...

Last...

Related Question