Solved – How to verify the bias-variance decomposition using a simluated experiment

bias-variance tradeoffmachine learningrregressionsimulation

The bias-variance decomposition can be expressed as:

\begin{align} \newcommand{\Var}{{\rm Var}}
E[(y_0 – \hat{f}(x_0)) ^ 2] &= (E[\hat{f}(x_0)] – f(x_0)) ^ 2 + E[(\hat{f}(x_0) – E[\hat{f}(x_0)]) ^ 2] + \sigma ^ 2\\
&= [{\rm Bias}(\hat{f}(x_0))] ^ 2 + \Var(\hat{f}(x_0)) + \Var(\varepsilon)
\end{align}

I tried to verify the expression using a simulated experiment, but the result seems to suggest that the left hand side of the expression is not equal to the right hand side. Here's the R code to reproduce the experiment:

library(tidyverse)

set.seed(1)

coefs <- rnorm(4) # generates four parameters of the underlying linear model (see below)

training_sets <- lapply(1:100, function(...) { # 100 training sets, each with 100 observations
  tibble(
    X1 = rnorm(100),
    X2 = rnorm(100),
    X3 = rnorm(100),
    fX = coefs[1] * X1 + coefs[2] * X2 + coefs[3] * X3 + coefs[4], # a simple linear regression model with three predictors and an intercept
    Y = fX + rnorm(100) # adds irreducible error
  )
})

fX_estimates <- training_sets %>% # train a linear model on each training set
  lapply(function(training_set_i) {
    lm(Y ~ X1 + X2 + X3 + 1, # the last term (i.e., '1') represents the intercept
       data = training_set_i)
  })

test_set <- tibble( # generate a test set with one observation, from the same population as the training sets
  X1 = rnorm(1),
  X2 = rnorm(1),
  X3 = rnorm(1),
  fX = coefs[1] * X1 + coefs[2] * X2 + coefs[3] * X3 + coefs[4],
  Y = fX + rnorm(1) # adds irreducible error
)

y0 <- test_set$Y

fx0 <- test_set$fX

fx0_estimates <- fX_estimates %>% # estimates of the outcome based on models trained with different training sets
  sapply(function(fX_estimate_i) {
    predict(fX_estimate_i, newdata = test_set)
  })

lhs <- mean((y0 - fx0_estimates) ^ 2) # value of the left hand side of the expression

rhs <- (mean(fx0_estimates) - fx0) ^ 2 + mean((fx0_estimates - mean(fx0_estimates)) ^ 2) + 1 ^ 2 # value of the right hand side of the expression

Here're some of my guesses that may explain the results:

  1. Some misunderstanding of the bias-variance decomposition, and hence the wrong math expressions and the wrong experiment.
  2. Some kind of bias was introduced by the way I generated the data and the underlying f(X).
  3. The inequality was introduced by random noise. However, increasing the sample size is not helpful. Additionally, when I ran the code with different random seeds for 100 time, $E[(y_0 – \hat{f}(x_0)) ^ 2]$ has a mean of 0.66 and a standard deviation of 0.78, while $(E[\hat{f}(x_0)] – f(x_0)) ^ 2 + E[(\hat{f}(x_0) – E[\hat{f}(x_0)]) ^ 2] + \sigma ^ 2$ has a mean of 1.04 and a standard deviation of 0.03. They are still not equal, and it seems that $E[(y_0 – \hat{f}(x_0)) ^ 2]$ is much more variable than $(E[\hat{f}(x_0)] – f(x_0)) ^ 2 + E[(\hat{f}(x_0) – E[\hat{f}(x_0)]) ^ 2] + \sigma ^ 2$.

So, what seems to be the problem?

Best Answer

You have made some mistakes in the R code, I suspect due to over-complexifying the problem.

Consider the following (data.table) R code as a starting point:

res <- data.table(lhs1=rep(0,10), lhs2=rep(0,10), rhs=rep(0,10))
dt <- data.table(y=rep(0,100), x1=rep(0,100), x2=rep(0,100), x3=rep(0,100),
                 fx=rep(0,100), e=rep(0,100))
for (i in seq_len(nrow(res))) {
   dt[, ':='(x1=rnorm(100), x2=rnorm(100), x3=rnorm(100), e=rnorm(100))]
   dt[, c("y","fx") := {fx = x1 + x2 + x3
                        y = fx + e
                       .(y=y, fx=fx)}]
   res[i, ':='(lhs1 = mean(dt$y^2),
               lhs2 = mean(dt$y)^2 + var(dt$y)*(nrow(dt)-1)/nrow(dt),
               rhs = var(dt$fx + dt$e)*(nrow(dt)-1)/nrow(dt) +  # Variance
                     mean(dt$fx + dt$e)^2)]                     # Bias^2

}
> head(res)
       lhs1     lhs2      rhs
1: 4.217220 4.217220 4.217220
2: 5.020779 5.020779 5.020779
3: 3.537500 3.537500 3.537500
4: 4.064400 4.064400 4.064400
5: 3.591889 3.591889 3.591889
6: 4.765356 4.765356 4.765356

The key to understanding what the bias-variance decomposition is lies in the two lines:

   res[i, ':='(lhs1 = mean(dt$y^2),
               lhs2 = mean(dt$y)^2 + var(dt$y)*(nrow(dt)-1)/nrow(dt),

It is nothing more than:

$$\text{E}(y^2) = \text{E}(y)^2 + \text{Var}(y-\text{E}(y))$$

The correspondence between the math and the code should be clear, with the additional note that the variance of dt$y in the code is divided by n-1 by default, but, since this is a population relationship, needs to be corrected to be divided by n instead.

The line rhs = ... is there simply to show how this calculation should be performed if you have an expression of the form $y_i = f(x_i) + e_i$ instead of the simpler $y_i = \mu + e_i$ (where, above, $\mu = 0$.)

Related Question