Solved – How to build a predictive model with a billion of sparse features

dimensionality reductionfeature selectionlogisticregressionsparse

I am making a model to learn a dataset which has a big feature number and sparse samples (I am planning to use logistic regression). The feature number can be as big as 1,000,000,000. It is sparse meaning that there are a lot of zeros than ones (maybe one out of one thousand is one and others are zero). To deal with this dataset I should do some dimensionality reduction, or the machine can not deal with the model, and also I want to find some method to deal with the sparseness. So my questions are:

  1. How to do reduce the dimension?

  2. How to deal with the sparseness?

Best Answer

An alternative to dimensionality reduction is to use the hashing trick to train a classifier on the entire feature set without reduction beforehand.* The Vowpal Wabbit pwoject--er, project--is an implementation of various learning algorithms using the hashing trick to speed up computation:

VW is the essence of speed in machine learning, able to learn from terafeature datasets with ease. Via parallel learning, it can exceed the throughput of any single machine network interface when doing linear learning, a first amongst learning algorithms.

I don't know if VW will end up being right for you (if you have billions of features, a lot of your choices may end up being dictated by software engineering considerations), but hopefully it's a pointer in the right direction!

* Well, the hashing trick is technically a kind of dimensionality reduction, but only in a very silly sense.

Related Question