Solved – First step for big data ($N = 10^{10}$, $p = 2000$)

data mininglarge datamachine learningr

Suppose you are analyzing a huge data set at the tune of billions of observations per day, where each observation has a couple thousand sparse and possibly redundant numerical and categorial variables. Let's say there is one regression problem, one unbalanced binary classification problem, and one task of "find out which predictors are most important." My thought for how to approach the problem is:

Fit some predictive model on progressively larger and larger (random) sub-samples of the data until:

  1. Fitting and cross-validating the model becomes computationally difficult (e.g., unreasonbly slow on my laptop, R runs out of memory, etc.), OR

  2. The training and test RMSE or precision/recall values stabilize.

If the training and test errors did not stabilize (1.), use a simpler model and/or implement multicore or multinode versions of the model and restart from the beginning.

If the training and test errors stabilized (2.):

  • If $N_{subset} \ll N$ (i.e., I can still run algorithms on $X_{subset}$ as it's not too large yet), try to improve performance by expanding the feature space or using a more complex model and restarting from the beginning.

  • If $N_{subset}$ is 'large' and running further analyses is costly, analyze variable importance and end.

I plan to use packages like biglm, speedglm, multicore, and ff in R initially, and later use more complicated algorithms and/or multinode (on EC2) as necessary.

Does this sound like a reasonable approach, and if so, do you have any specific advice or suggestions? If not, what would you try instead for a data set of this size?

Best Answer

You should check out online methods for regression and classification for datasets of this size. These approaches would let you use the whole dataset without having to load it into memory.

You might also check out Vowpal Wabbit (VW):

https://github.com/JohnLangford/vowpal_wabbit/wiki

It uses an out of core online method, so it should be able to handle a dataset of this size. You can do regression and classification and it has support for sparse formats. You can also do penalized versions (e.g. lasso-type regression/classification) in VW, which could improve your model's accuracy.

Related Question