Solved – Out-of-core data analysis options

large datarsas

I have been using SAS professionally for close to 5 years now. I have it installed on my laptop and frequently have to analyze datasets with 1,000-2,000 variables and hundreds of thousands of observations.

I have been looking for alternatives to SAS that allow me to conduct analyses on similar sized data sets. I am curious what other people use for situations such as this. This certainly isn't "Big Data" in the manner that is used today. Nor are my datasets small enough to hold in memory. I need a solution that can apply algorithms to data stored on a hard drive. These are the things I have investigated to no avail:

  1. R – BigMemory can create matrices stored out of memory, but the elements have to be the same mode. I work with data that is almost a 50/50 split between character and numeric. The FF package gets closer to what I need, but I don't quite understand which procedures are compatible with it. I think support is somewhat limited.
  2. Pandas – I was very excited about a Pythonic alternative to R. However, it too has to hold all of the data in memory.
  3. Revolution R – This one shows quite a bit of promise. I have a copy on my home computer (free if you sign up for Kaggle) and have yet to test it as viable alternative to SAS. Comments on Revolution R as a SAS alternative are much appreciated.

Thanks

UPDATE 1

Editing to add that I am looking for real-life, practical solutions that people have used successfully. For the most part, SAS lets me chug through big files without worrying one bit about memory constraints. However SAS is implemented, they figured out how to make memory management transparent to the user. But, it is with a heavy heart that I used SAS for my job (I have to) and would LOVE a FOSS alternative that allows me to work on "large" data without having to think too hard about where the data is located at a specific time (in memory or on disk).

The closest things I have come across are R's FF package and something on the horizon for Python called Blaze. And yet, these problems have existed for many years so what have analysts been doing in the mean time? How are they handling these same issues with Memory limits? The majority of solutions on offer seem to be:

  • Get more RAM — This isn't a good solution, imo. It's easy to find a dataset that can exceed RAM yet still fit on a hard-drive. Furthermore, the work flow has to accommodate all of the structures that are created during exploratory data analysis.
  • Subset the data — This is fine for exploration but not for finalizing results and reporting. Eventually, whatever processes are developed on a subset will have to be applied to the entire dataset (in my case, anyway).
  • Chunk through the data — This is what I would like to know more about from people who actually implement this work-flow. How is it done? With what tools? Can it be done in a way that's transparent to the user? (i.e., create some on-disk data structure and the frame-work takes care of the chunking under the hood).

Best Answer

if you're maxing out at 500,000 records x 2,000 variables, i would spend a little more money on RAM for your laptop and be done with it. if you have 16GB, you can probably read the data set you're describing into R directly. and at that point, you'll be able to do far more - and very quickly.. but you say that's not an option, so:

look at SQL-based packages for R. these allow you to connect to external databases and access those tables via SQL. since SQL is pretty universal (and since R is open-source), your code won't be lost if you change jobs or lose access to SAS. the easiest external database to set up is RSQLite but by far the fastest is MonetDB.R (speed tests)

there are probably a few good solutions to your stated problem, my guess is that just about all of them involve R ;)