I have a question regarding the algorithm of Gradient Boosting Tree. I understand Simple tree is built for only a randomly selected sub sample of the full data set (random without replacement). Each consecutive tree is built for the prediction residuals (from all previous trees) of an independently drawn random sample.If an observation is incorrectly classified, then the observation will receive more weight in the next iteration so that it gets properly classified.
My question – if trees are built on random without replacement sample (different observations in each tree). How thousands of trees are built if each tree is built on different observations altogether (random without replacement)? For example, i have a data set consisting of 20,000 records. If i take 0.5 as the fraction of the training set observations randomly selected to propose the next tree in the expansion, how would i be able to create more than 2 trees? My understanding – 50% of the data set is used in building the first tree and the remaining 50% in building the next tree as samples are drawn without replacement. I apologize if it sounds lame. I am very confused in the tree building process.
Best Answer
To clarify, here are my two cents:
Adaboost is actually a particular case of Gradient Boosting where the loss function to minimize is the the exponential loss. The reweighing scheme in Adaboost is due to the exponential loss function, but more generally, each new weak classifier in the sequence is fitted directly on the gradient of the loss function. For instance, for least-squares Boosting (where the loss function is squared error), each new tree is built on the residuals of the current model (the sequence, or additive model, of all the trees built so far). I start at Equation 10.6 page 342 of the Elements of Statistical Learning:
The loss function is: $L(y,f(x))=(y-f(x))^2$
At iteration $m$, we want to come up with a new tree to add to the current model $f_{m-1}(x)$ (sum of all previous trees) in order to get the new model $f_{m}(x)$. We select the tree that minimizes $L$, or equivalently, the tree that best fits the residuals of the current model $y- f_{m-1}(x)$.