Solved – R as an alternative to SAS for large data

large datarsas

I know that R is not particularly helpful for analysing large datasets given that R loads all the data in memory whereas something like SAS does sequential analysis. That said, there are packages like bigmemory that allows users to perform large data analysis (statistical analysis) more efficiently in R.

I wanted to know, apart from all the theoretical information, has anyone used / is using R for analysing large datasets in an enterprise environment and what are the typical issues that could arise. By large datasets I am referring to datasets that are ~ 200 GB in size. Also, any thoughts on real-life examples of migrating from SAS to R in such use cases would be helpful.

Best Answer

I have done work on very large data sets in R, and not had problems.

There are several approaches that work, but my basic paradigm is that I find ways to process the data "sequentially". Obviously SAS has the same fundamental memory constraints if you're using it on the same machine, using R is just a little more DIY.

In every case that I have ever encountered I'm either doing analysis on some kind of summary of the data, or I'm doing analysis on chunks of the data and then summarizing the results. Either way that's easy to accomplish in R.

It's pretty easy to create summaries if you have your data structured in some way (really in any way). Hadoop is a leading tool for creating summaries, but it's easy to do batch processing on R Data files, and if your data will fit on your local storage device, it's also faster to batch process it that way (in terms of both processing time and development time).

It's also pretty easy to batch your analysis by chunk as well using the same thought process.

If you're really dying to do a linear model directly on a gigantic data set, then I think bigmemory is your answer, as suggested by Stéphane Laurent.

I don't really think there is one "answer" to "how do you deal with memory constraints" or "move to a new platform", but this is my long winded two cents.