Solved – Bias and Variance – Errors in R example

biasrvariance

I am trying to better understand the bias and variance trade-off, and tried to create a R example. It attempts to calculate the bias and variance of smoothing splines with different parameters. However, I fear that bias and MSE calculations are incorrect (based on graph) and i fail to demonstrate that the MSE is equal to variance+squared bias+irreducible error. Can anyone help me and let me know where i am going wrong?

graphical output

#define generating function
x=seq(from=0,to=100,by=1)
e=rnorm(n=length(x),mean=0,sd=2)
y=0.25*x #linear
obs.data=0.25*x+e #observed data - with error
values.real=cbind(x,y,obs.data) 

#we observe 10 datapoints  
train.data=values.real[sample(1:100,size=10,replace=FALSE),] #linear

#graphing - quick and dirty
plot(x,y,col="white",main="x, y and observed data (points)")
lines(x,y) 
points(x=train.data[,1],y=train.data[,3])

#----section 2 - test plot - fits based on 80 different observations of 10 training points
# str(yhat.mat)
yhat.mat=NULL #clean matrix

for(i in 1:80){
  train.data=values.real[sample(1:100,size=10,replace=FALSE),] #sample 10 points
  spline=smooth.spline(x=train.data[,1],y=train.data[,3],w = NULL, spar = 0.1) #fit spline
  y.hat=predict(spline,x=values.real[,1])
  yhat.mat=rbind(y.hat$y,yhat.mat)
}

matplot(t(yhat.mat), col=rgb(0, 0, 1, 0.07), type="l",main="actual relationship (black) and different fits (blue)")
lines(x,y)

# section 3 - determining bias and variance across smoothing values ----

# setting up of variables
bias.vec=NULL;var.vec=NULL;mse.vec=NULL #initialize vectors
smooth.val=seq(from=0.1,to=1,by=0.05) #run through diff. smooth levels

#run loop on smoothing values - higher values more restricted
for(k in 1:length(smooth.val)){
  yhat.mat=NULL #clean matrix

#fit 80 splines for each smoothing value, based on 80 observations of 10 observed datapoints
for(i in 1:80){
train.data=values.real[sample(1:100,size=10,replace=FALSE),] #sample 10 points
spline=smooth.spline(x=train.data[,1],y=train.data[,3],w = NULL, spar = smooth.val[k]) #fit spline
y.hat=predict(spline,x=values.real[,1])
yhat.mat=rbind(y.hat$y,yhat.mat)
}
#variance calculation
(variance=mean(apply(yhat.mat,2,FUN=var))) 
#bias - unsure if correct
mean.est=apply(yhat.mat,2,FUN=mean) #calculate mean for each x value 
bias.est=mean.est-y #subtract y values
bias=mean(bias.est) #average over all x values
bias=bias^2 #square
#mse - unsure if correct
mse.mat=sweep(yhat.mat,MARGIN=2,values.real[,2],FUN="-") #subtract Y value
mse.mat=mse.mat^2 #square for variance
mse=sum(apply(mse.mat,2,FUN=sum))/(101*80) #take grand average

bias.vec[k]=bias;var.vec[k]=variance;mse.vec[k]=mse
}

bias.vec; var.vec; mse.vec #check vectors

#plot of outcomes
plot(smooth.val,bias.vec,col="white",ylim=c(0,40),main="squared bias (orange), variance (blue),irreducible error(pink), mse (red)",xlab="smoothing factor (high is more regularized)", ylab="")
lines(smooth.val,bias.vec,col="orange")
lines(smooth.val,var.vec,col="blue")
lines(smooth.val,mse.vec,col="red")
abline(a=4,b=0,col="pink") #based on standard deviation of 2 

Best Answer

If I've understood the question correctly, the goal is to evaluate an estimator for $y$. If so, the problem is that you're counting the irreducible error twice. In mathematical notation, letting the estimate be denoted by $Z$, we have $$\begin{split}\textrm{MSE}(Z) &= \mathbb{E} ((Z-y)^2 ) \\ &=\mathbb{E} ((Z-\mathbb{E} (Z) )^2 ) +\mathbb{E} ((\mathbb{E}(Z) -y)^2 ) \\ &=\mathbb{E} ((Z-\mathbb{E} (Z) )^2 ) +\mathbb{E}((\mathbb{E} (Z) -\mathbb{E} (y) )^2 ) +\mathbb{E} ((\mathbb{E}(y) -y)^2 ) \\ &=\underbrace{\mathrm{Var} (Z) }_{\textrm{variance} } +\underbrace{(\mathbb{E} (Z) -\mathbb{E} (y) )^2 }_{\textrm{bias} ^2} +\underbrace{\mathrm{Var} (y) }_{\textrm{irreducible error} } . \end{split}$$ The important thing to notice here is the definition of the bias: it's the expectation of the estimate, minus the true posterior expectation of $y$. However, in the code, you're calculating the bias as $\mathbb{E} (Z) -y$. That means you've calculated the MSE as an empirical estimate of $$\textrm{MSE} (Z) =\textrm{Var} (Z) +\mathbb{E} ((\mathbb{E} (Z) -y)^2 ) ,$$ so the second term is the square bias plus the irreducible error, as in the second line of the equation above. This means the pink line in your plot is superfluous, because the irreducible error is already reflected in the orange square-bias line.

Therefore, you should be able to add the blue and orange lines, and have it roughly match the red MSE line. It won't be exact, because of the bootstrap issues mentioned by hitchpy, but it'll be pretty close. An example with major overlap is below, where the green line is the sum. Example Variance-bias plot