Solved – R regression and large data sets

large datarreferencesregression

I would like to start learning about doing computations on large data sets. I have read lots of posts about things like map reduce and Hadoop but nothing that shows examples or even clearly explains how these things work. Hadoop to still seems like a magic word people wave around. So suppose I want to do something like basic regression in R but in stead of a dozen variables and a few hundred data points, I want to do it on 800 variables and 1 million data points. How would I attempt such a problem. Can R still be used for something like this? Would I need Amazon's cloud? Are there any tutorials out there that walk you through this type of problem?

Best Answer

I would have a look at the High Performance computing Task View, which suggests biglm as a means to analyze big datasets that cannot fit in your computers RAM. Alternatively, you can develop your algorithms on a subset of your data, and then perform the real calculations on the entire dataset using an Amazon EC2 instance with 32 GB of RAM.