You have made some mistakes in the R code, I suspect due to over-complexifying the problem.
Consider the following (data.table
) R code as a starting point:
res <- data.table(lhs1=rep(0,10), lhs2=rep(0,10), rhs=rep(0,10))
dt <- data.table(y=rep(0,100), x1=rep(0,100), x2=rep(0,100), x3=rep(0,100),
fx=rep(0,100), e=rep(0,100))
for (i in seq_len(nrow(res))) {
dt[, ':='(x1=rnorm(100), x2=rnorm(100), x3=rnorm(100), e=rnorm(100))]
dt[, c("y","fx") := {fx = x1 + x2 + x3
y = fx + e
.(y=y, fx=fx)}]
res[i, ':='(lhs1 = mean(dt$y^2),
lhs2 = mean(dt$y)^2 + var(dt$y)*(nrow(dt)-1)/nrow(dt),
rhs = var(dt$fx + dt$e)*(nrow(dt)-1)/nrow(dt) + # Variance
mean(dt$fx + dt$e)^2)] # Bias^2
}
> head(res)
lhs1 lhs2 rhs
1: 4.217220 4.217220 4.217220
2: 5.020779 5.020779 5.020779
3: 3.537500 3.537500 3.537500
4: 4.064400 4.064400 4.064400
5: 3.591889 3.591889 3.591889
6: 4.765356 4.765356 4.765356
The key to understanding what the bias-variance decomposition is lies in the two lines:
res[i, ':='(lhs1 = mean(dt$y^2),
lhs2 = mean(dt$y)^2 + var(dt$y)*(nrow(dt)-1)/nrow(dt),
It is nothing more than:
$$\text{E}(y^2) = \text{E}(y)^2 + \text{Var}(y-\text{E}(y))$$
The correspondence between the math and the code should be clear, with the additional note that the variance of dt$y
in the code is divided by n-1
by default, but, since this is a population relationship, needs to be corrected to be divided by n
instead.
The line rhs = ...
is there simply to show how this calculation should be performed if you have an expression of the form $y_i = f(x_i) + e_i$ instead of the simpler $y_i = \mu + e_i$ (where, above, $\mu = 0$.)
It is said that bagging reduces variance and boosting reduces bias.
Indeed, as opposed to the base learners both ensembling methods employ.
For bagging and random forests, deep/large trees are generally employed as base learners. Large trees have high variance, but low bias. Ensembling many large trees reduces the variance.
Boosting is most effective with 'weak learners': base learners that perform slightly better than chance. Small trees generally work best, often stumps (i.e., single-split trees) are even used with boosting. Small trees have low variance, but high bias. Averaging over many trees (combined with updating the response variable after fitting each tree, which puts more weight on training observations not well predicted thus far) thus reduces the bias.
Best Answer
Quite surprising that the experts couldn't help you out, the chapter on random forests in "The Elements of Statistical Learning" explains it very well.
Basically, given n i.d.d. random variables each with variance sigma², the variance of the mean of this variables will be sigma²/n.
Since the random forest is build on bootstrap samples of the data, the outputs of the individual trees can be viewed as identically distributed random variables.
Thus, by averaging the outputs of B trees, the variance of the final prediction is given by p*sigma² + (1 - p)sigma² / B, where p is the pairwise correlation between trees. For large B the right term vanishes and the variance is reduced to p*sigma².
This works not only for decision trees but every model that's baggable. The reason why it works particularly well for decision trees is that they inherently have a low bias (no assumptions are made, such as e.g linear relation between features and response) but a very high variance.
Since only the variance can be reduced, decision trees are build to node purity in context of random forest and tree bagging. (Building to node purity maximizes the variance of the individual trees, i.e. they fit the data perfectly, while minimizing the bias.)