Solved – Updating decision tree for new data

boostingcartmachine learning

Lets say you have trained a decision tree for 40 gigs of data.
On Monday morning you receive 10Gig new data and produce some results quickly
to report to your boss.

Can you update the decision tree using the new data?

or all data has to be used to train the tree
and that would take long, and therefore task is not doable?
If doable how?
if not doable, why?

Best Answer

Given we have a single decision tree that has been trained using an initial dataset, yes, we can theoretically update it using new data instead of retrain it. Practically though, no, it is somewhat of an academic exercise in the use case described.

Theoretically it is feasible. There is research on this kind of functionality, particularly on the use of decision trees for data streams. An example method would be UCVFDT (Uncertainty-handling and Concept-adapting Very Fast Decision Trees) by Liang et al., new data are weighted more heavily than priorly known and the tree is rebalanced sequentially. Krawczyk et al. (2017) Ensemble learning for data stream analysis: A survey presents a nice overview on data stream analysis. Boosting approaches lend themselves a bit more naturally to stream data as the concept of "iteration" is more explicit but single tree models are also feasible. Having mentioned the above, if we only have a single batch coming up once a week we might as well retrain the decision tree model we use. The resulting model will be more accurate, better understood, easier to maintain and readily available in almost any programming language.