Solved – lightgbm: Understanding why it is fast

boosting

Inspite of googling to the best of my ability, unfortunately I am unable to find reasons why lightgbm is fast. The lightgbm documentation explains that the strategy followed is 'Leaf-wise (Best-first) Tree Growth' as against 'Level wise Tree Growth'. I am unable to understand the difference. In so far as I understand, in a decision tree, at every node, before splitting, information gain that would result from each candidate feature is calculated and that feature is selected for the split at that node which will provide maximum information gain at that node. In this paper on lightgbm, (Guolin Ke & others) mention about Gradient-based One-Side Sampling (GOSS). Unfortunately I am unable to understand this also.
I am familiar with the concept of information gain in measuring impurity but unable to understand what role gradient plays in it and also what is meant by gradient of a data-point. Is it possible to help in layman's language.

Best Answer

Since a more detailed explanation was asked:

There are three reasons why LightGBM is fast:

  • Histogram based splitting
  • Gradient-based One-Side Sampling (GOSS)
  • Exclusive Feature Bundling (EFB)

Histogram based splitting is in the literature since the late 1990's, but it became popular with Xgboost, that was the first publicly available package to implement it. Since finding the exact optimal split is very costly when there's a lot of data (since it involves testing every possible split point), using a quantile (or histogram) based approximate solution can make the splitting procedure much faster, without losing too much accuracy. This involves computing some optimal weighted quantiles of your feature (i.e. group data into bins), and chose the split points between these quantiles. The algorithm for this procedure can be found in Xgboost's paper. Xgboost proposed local and global histograms, meaning that they would be computed for every feature either at the beginning of the algorithm (global) or at every new split (local). LightGBM briefly says that it bases its work on histogram based splitting (there are many papers on this), but it does not clarify the way the histogram are computed nor how this is implemented together with GOSS.

Gradient-based One-Side Sampling (GOSS) is an exclusive feature of LightGBM, and it's some sort of advanced subsampling of the data. Since the computational time for split finding is proportional to the number of features and instances, subsampling the instances makes this problem faster, and this is also the idea behind Stochastic Gradient Boosting by Friedman. However, SGB samples the data randomly, often causing a decrease in accuracy of the model. What GOSS does instead is something similar to Adaboost - records are weighted by their pseudo-residuals - since instances with low residuals have little impact on the training as they are already well-trained. Therefore, high-residuals records are kept while low-residuals ones are heavily subsampled, and their weights are recalibrated in order to avoid inserting a bias in the distribution of the residuals. This greatly reduces the number of instances, while maintaining an extremely good performance, and it is one of the reasons why the algorithm is performing better than other histogram based packages such as H2O or XGboost.

Exclusive Feature Bundling (EFB) is used to deal with sparse features. I will not get into the details at all, mostly because I am not particularly familiar with them; however, suffice to say that EFB is used to bundle sparse features together (features that are never non-zero together), in a way that greatly reduces computational effort on big sparse datasets (as mentioned, finding splits is also proportional to the total number of features). The optimal bundling of the sparse features is usually an NP-hard problem, but it is solved with good approximation through a greedy algorithm.

In their documentation they also mention the leaf-growth first of the trees. This is not mentioned, as far as I know, in the paper, but it's supposed to be used to increase accuracy and not speed.

Source: LightGBM paper :)

Related Question