Solved – How to get confidence interval on population r-square change

confidence intervalestimationr-squaredregressionregularization

For sake of a simple example assume that there are two linear regression models

  • Model 1 has three predictors, x1a, x2b, and x2c
  • Model 2 has three predictors from model 1 and two additional predictors x2a and x2b

There is a population regression equation where the population variance explained is $\rho^2_{(1)}$ for Model 1 and $\rho^2_{(2)}$ for Model 2. The incremental variance explained by Model 2 in the population is $\Delta\rho^2 = \rho^2_{(2)} – \rho^2_{(1)}$

I am interested in obtaining standard errors and confidence intervals for an estimator of $\Delta\rho^2$. While the example involves 3 and 2 predictors respectively, my research interest concerns a wide range of different numbers of predictors (e.g., 5 and 30). My first thought was to use $\Delta r^2_{adj} = r^2_{adj(2)} – r^2_{adj(1)}$ as an estimator and bootstrap it, but I wasn't sure whether this would be appropriate.

Questions

  • Is $\Delta r^2_{adj}$ a reasonable estimator of $\Delta \rho^2$?
  • How can a confidence interval be obtained for the population r-square change (i.e., $\Delta\rho^2$)?
  • Would bootstrapping $\Delta\rho^2$ be appropriate for confidence interval calculation?

Any references to simulations or the published literature would also be most welcome.

Example code

If it helps, I created a little simulation dataset in R which could be used to demonstrate an answer:

n <- 100
x <- data.frame(matrix(rnorm(n *5), ncol=5))
names(x) <- c('x1a', 'x1b', 'x1c', 'x2a', 'x2b')
beta <- c(1,2,3,1,2)
model2_rho_square <- .7
error_rho_square <- 1 - model2_rho_square
error_sd <- sqrt(error_rho_square / model2_rho_square* sum(beta^2))
model1_rho_square <- sum(beta[1:3]^2) / (sum(beta^2) + error_sd^2)
delta_rho_square <- model2_rho_square - model1_rho_square

x$y <- rnorm(n, beta[1] * x$x1a + beta[2] * x$x1b + beta[3] * x$x1c +
               beta[4] * x$x2a + beta[5] * x$x2b, error_sd)

c(delta_rho_square, model1_rho_square, model2_rho_square)
summary(lm(y~., data=x))$adj.r.square - 
        summary(lm(y~x1a + x1b + x1c, data=x))$adj.r.square

Reason for concern with bootstrap

I ran a bootstrap on some data with around 300 cases, and 5 predictors in the simple model and 30 predictors in the full model. While the sample estimate using adjusted r-square difference was 0.116, the boostrapped confidence interval were mostly larger CI95%(0.095 to 0.214) and the mean of the bootstraps was nowhere near the sample estimate. Rather the mean of the boostrapped samples appeared to be centred on the sample estimate of the difference between r-squares in the sample. This is despite the fact that I was using the sample adjusted r-squares to estimate the difference.

Interestingly, I tried an alternative way of computing $\Delta\rho^2$ as

  1. calculate sample r-square change
  2. adjust the sample r-square change using the standard adjusted r-square formula

When applied to the sample data this reduced the estimate of $\Delta \rho^2$ to .082 but the confidence intervals seemed appropriate for the the method I mentioned first, CI95% (.062, .179) with mean of .118.

Broadly, I'm concerned that bootstrapping assumes that the sample is the population, and therefore estimates that reduce for overfitting may not perform appropriately.

Best Answer

Population $R^2$

I'm firstly trying to understand the definition of the population R-squared.

Quoting your comment:

Or you could define it asymptotically as the proportion of variance explained in your sample as your sample size approaches infinity.

I think you mean this is the limit of the sample $R^2$ when one replicates the model infinitely many times (with the same predictors at each replicate).

So what is the formula for the asymptotic value of the sample $R^²$ ? Write your linear model $\boxed{Y=\mu+\sigma G}$ as in https://stats.stackexchange.com/a/58133/8402, and use the same notations as this link.
Then one can check that the sample $R^2$ goes to $\boxed{popR^2:=\dfrac{\lambda}{n+\lambda}}$ when one replicates the model $Y=\mu+\sigma G$ infinitely many times.

As example:

> ## design of the simple regression model lm(y~x0)
> n0 <- 10
> sigma <- 1
> x0 <- rnorm(n0, 1:n0, sigma)
> a <- 1; b <- 2 # intercept and slope
> params <- c(a,b)
> X <- model.matrix(~x0)
> Mu <- (X%*%params)[,1]
> 
> ## replicate this experiment k times 
> k <- 200
> y <- rep(Mu,k) + rnorm(k*n0)
> # the R-squared is:
> summary(lm(y~rep(x0,k)))$r.squared 
[1] 0.971057
> 
> # theoretical asymptotic R-squared:
> lambda0 <- crossprod(Mu-mean(Mu))/sigma^2
> lambda0/(lambda0+n0)
          [,1]
[1,] 0.9722689
> 
> # other approximation of the asymptotic R-squared for simple linear regression:
> 1-sigma^2/var(y)
[1] 0.9721834

Population $R^2$ of a submodel

Now assume the model is $\boxed{Y=\mu+\sigma G}$ with $H_1\colon\mu \in W_1$ and consider the submodel $H_0\colon \mu \in W_0$.

Then I said above that the population $R^2$ of model $H_1$ is $\boxed{popR^2_1:=\dfrac{\lambda_1}{n+\lambda_1}}$ where $\boxed{\lambda_1=\frac{{\Vert P_{Z_1} \mu\Vert}^2}{\sigma^2}}$ and $Z_1=[1]^\perp \cap W_1$ and then one simply has ${\Vert P_{Z_1} \mu\Vert}^2=\sum(\mu_i - \bar \mu)^2$.

Now do you define the population $R^2$ of the submodel $H_0$ as the asymptotic value of the $R^2$ calculated with respect to model $H_0$ but under the distributional assumption of model $H_1$ ? The asymptotic value (if there is one) seems more difficult to find.

Related Question