Solved – XGBoost paper – time complexity analysis

I'm reading through the XGBoost paper and I'm confused by the subsection of 4.1 titled "Time Complexity Analysis". Here the authors assert that the exact greedy algorithm with $K$ trees, a maximum depth of $d$ per tree, and $\|\mathbf{x}\|_0$ "non-missing entries in the training data", using the original sparse-aware algorithm, incurs a time complexity of
$$O(Kd\|\mathbf{x}\|_0\log n).$$
The first 3 factors make sense to me (the $Kd\|\mathbf{x}\|_0$ part) – I'm interpreting it as:

There's $K$ trees, giving you a factor of $K$
At each of the $d$ layers of each tree, you need to scan through $\|\mathbf{x}\|_0$ block entries to find the optimal split. (My understanding is that $\|\mathbf{x}\|_0$ is the total number of nonzero feature values, aggregated across all feature columns and all training examples. This would mean that each block consists of $3\|\mathbf{x}\|_0$ numbers in CSC format.) This gives a factor of $d\|\mathbf{x}\|_0$.

However, I'm not sure what $n$ is supposed to signify (it's not specified in this section of the paper) or where the $\log n$ factor comes from. Based on earlier usage in the paper, I'd assume that $n$ is the number of training examples, but it's not clear to me how that results in a multiplicative $\log n$ increase in time complexity of the exact greedy algorithm.

Does anyone understand how the authors got to this time complexity? Any help would be much appreciated.

Best Answer

Answered my own question - turns out I misread the paper.

The original sparse greedy algorithm doesn't use block storage. Thus to find the optimal split at each node, you needed to re-sort the data on each column. This ends up incurring a time complexity at each layer that is very crudely approximated by $O(\|\mathbf{x}\|_0 \log n)$: basically, say you have $\|\mathbf{x}\|_{0i}$ nonzero entries for each feature $1 \le i \le m$; then at each layer you're sorting lists, each of length at most $n$, whose lengths sum to $\sum_{i=1}^m \|\mathbf{x}\|_{0i} = \|\mathbf{x}\|_0$, which can't take more than $O(\|\mathbf{x}\|_0 \log n)$ time. Multiplying by $K$ trees and $d$ layers per tree gives you the original $O(Kd\|\mathbf{x}\|_0 \log n)$ time complexity.

On the other hand, with the block structure, since the data is already pre-sorted by every column at the start, you don't need to re-sort at each node, as long as you keep track at each node of which data points have made it to that node. As the authors note, this brings the complexity down to $O(Kd\|\mathbf{x}\|_0)$ (since the optimal splits at each layer can now be found via a single scan over the block) plus whatever the cost of preprocessing is (the authors claim it to be $O(\|\mathbf{x}\|_0 \log n)$, which makes sense).

Best Answer

Related Solutions

Solved – Which gamma regression model to use for extrapolation