Solved – Are there any libraries available for CART-like methods using sparse predictors & responses

cartclassificationmachine learningrregression

I'm working with some large data sets using the gbm package in R. Both my predictor matrix and my response vector are pretty sparse (i.e. most entries are zero). I was hoping to build decision trees using an algorithm that takes advantage of this sparseness, as was done here). In that paper, as in my situation, most items have only a few of the many possible features, so they were able to avoid a lot of wasted computation by assuming that their items lacked a given feature unless the data explicitly said otherwise. My hope is that I could get a similar speedup by using this sort of algorithm (and then wrapping a boosting algorithm around it to improve my predictive accuracy).

Since they didn't seem to publish their code, I was wondering if there were any open-source packages or libraries (in any language) that are optimized for this case. Ideally, I'd like something that could take a sparse matrix directly from R's Matrix package, but I'll take what I can get.

I've looked around and it seems like this sort of thing should be out there:

  • Chemists seem to run into this issue a lot (the paper I linked above was about learning to find new drug compounds), but the implementations I could find were either proprietary or highly specialized for chemical analysis. It's possible one of them could be re-purposed, though.

  • Document classification also seems to be an area where learning from sparse feature spaces is useful (most documents don't contain most words). For instance, there's an oblique reference to a sparse implementation of C4.5 (a CART-like algorithm) in this paper, but no code.

  • According to the mailing list, WEKA can accept sparse data, but unlike the method in the paper I linked above, WEKA isn't optimized to actually take advantage of it in terms of avoiding wasted CPU cycles.

Thanks in advance!

Best Answer

I'd like to see a benchmark of their sparse implementation against a modern CART implementations used in rf's. That paper is quite old in terms of advances in this area and I would be surprised if it still provided significant speed up.

Part of the reason is that using a clever sorting algorithm like Quicksort in split searching can provide near O(n) performance for near constant features (including sparse ones). Fast implementations also track when a feature has become constant within a branch of a tree and should no longer be examined. Dense feature representations provide fast look ups in a cpu cache friendly fashion so you'd need a really clever sparse representation to win out in cpu cycles.

This is discussed here, here, here.

I actually implemented a sparse data representation of data at one point in my rf package CloudForest but found it to be slower then a dense representation of the data and abandoned it though it did provide some memory advantages.

My recommendation would be to try scikit learn or cloudforest built in boosting stuff and see if it is fast enough. Both can be extended with custom boosting criteria if you want to do something non standard. (I actually wrote cloudforest originally to work with large, highly dimensional genetic data sets which are very similar to what you are describing).

Related Question