I will accept the answer on 1) from Kunlun, but just to close this case, I will here give the conclusions on the two questions that I reached in my thesis (which were both accepted by my Supervisor):
1) More data produces better models, and since we only use part of the whole training data to train the model (bootstrap), higher bias occurs in each tree (Copy from the answer by Kunlun)
2) In the Random Forests algorithm, we limit the number of variables to split on in each split - i.e. we limit the number of variables to explain our data with. Again, higher bias occurs in each tree.
Conclusion: Both situations are a matter of limiting our ability to explain the population: First we limit the number of observations, then we limit the number of variables to split on in each split. Both limitations leads to higher bias in each tree, but often the variance reduction in the model overshines the bias increase in each tree, and thus Bagging and Random Forests tend to produce a better model than just a single decision tree.
Overfitting/High Variance:
Your data fits very well on the training set, but poorly on the cross-validaton set. If you have no cross-validation set than it means that it fits poorly on the test set.
Underfitting/ High bias:
Your data fits badly on the training set and also badly on the test/CV set.
=> In both cases the model fits badly on the test. However we want our model to fit well on the test set. Testing accuracy is more important than training accuracy, because you want to now how good your model is on data is has not yet seen.
Your interpretation is correct: If you have a low bias and a low variance than the model has a good training and a good testing acccuracy.
You can also deduct this from the confusion matrix:
Few missclassifications in the training set and many in the test set:
You have high variance. You are overfitting your data.
Many missclassifications in the training set and many in the test set:
You might have high bias. You are probably underfitting your data.
Many missclassifications in the training set and few in the test set:
This should usually not appear. Maybe you made a mistake. This usually occurs due to non-random sampling of your data. Shuffle your data and fit the model again.
Few missclassification in the training set and few in the test set:
You have low bias and low variance. You reached your goal! Congratulations.
Best Answer
A bit late to the party but i feel that this question could use answer with concrete examples.
I will write summary of this excellent article: bias-variance-trade-off, which helped me understand the topic.
The prediction error for any machine learning algorithm can be broken down into three parts:
Irreducible error
As the name implies, is an error component that we cannot correct, regardless of algorithm and it's parameter selection. Irreducible error is due to complexities which are simply not captured in the training set. This could be attributes which we don't have in a learning set but they affect the mapping to outcome regardless.
Bias error
Bias error is due to our assumptions about target function. The more assumptions(restrictions) we make about target functions, the more bias we introduce. Models with high bias are less flexible because we have imposed more rules on the target functions.
Variance error
Variance error is variability of a target function's form with respect to different training sets. Models with small variance error will not change much if you replace couple of samples in training set. Models with high variance might be affected even with small changes in training set.
Consider simple linear regression:
Obviously, this is a fairly restrictive definition of a target function and therefore this model has a high bias.
On the other hand, due to low variance if you change couple of data samples, it's unlikely that this will cause major changes in the overall mapping the target function performs. On the other hand, algorithm such as k-nearest-neighbors have high variance and low bias. It's easy to imagine how different samples might affect K-N-N decision surface.
Generally, parametric algorithms have a high bias and low variance, and vice versa.
One of the challenges of machine learning is finding the right balance of bias error and variance error.
Decision tree
Now that we have these definitions in place, it's also straightforward to see that decision trees are example of model with low bias and high variance. The tree makes almost no assumptions about target function but it is highly susceptible to variance in data.
There are ensemble algorithms, such as bootstrapping aggregation and random forest, which aim to reduce variance at the small cost of bias in decision tree.