I am making a model to learn a dataset which has a big feature number and sparse samples (I am planning to use logistic regression). The feature number can be as big as 1,000,000,000. It is sparse meaning that there are a lot of zeros than ones (maybe one out of one thousand is one and others are zero). To deal with this dataset I should do some dimensionality reduction, or the machine can not deal with the model, and also I want to find some method to deal with the sparseness. So my questions are:
-
How to do reduce the dimension?
-
How to deal with the sparseness?
Best Answer
An alternative to dimensionality reduction is to use the hashing trick to train a classifier on the entire feature set without reduction beforehand.* The Vowpal Wabbit pwoject--er, project--is an implementation of various learning algorithms using the hashing trick to speed up computation:
I don't know if VW will end up being right for you (if you have billions of features, a lot of your choices may end up being dictated by software engineering considerations), but hopefully it's a pointer in the right direction!
* Well, the hashing trick is technically a kind of dimensionality reduction, but only in a very silly sense.