Machine Learning – Bias and Variance of a Decision Tree for Classification Explained

bias-variance tradeoffcartmachine learning

There are lot of discussion about bagging and boosting in the context of decision trees and how Random Forest and other methods helps to tackle bias and variance. But how exactly can I measure bias and variance in a decision tree ? If I am performing a k-fold cross validation, similar to a least squares regression can I calculate Bias and Variance ? I see many packages allows the user to input a parameter for folds in a Random Forest algorithm. Is the output of cross validation (which will be a confusion matrix) be used to create a bias and variance metric of some sort ? Can you please explain the bias and variance in the context of a binary classification using decision tree and how cross validation helps ?

Best Answer

There is nothing special about estimating bias and variance in ensemble methods (whether bagging or boosting). It is just like estimating them for any other supervised learner.

To estimate bias you start by assuming a fixed theoretical limit of accuracy, aka Bayesian Risk. Let's say this limit corresponds to 100% accuracy. Then you calculate the training error. The difference between the accuracy on training data and the the best achievable accuracy is an estimate of the bias. For example, if you get an 80% accuracy, then you have a bias problem.

Afterwards you calculate the accuracy on a test set that you kept aside (i.e. you didn't train on). The difference between training error/accuracy and the test error/accuracy is an estimate of variance.

More accurate estimates of the variance can be computed using k-fold cross validation.