Solved – Choosing statistical software based on the size of dataset

software

The statistical software that I am familiar with, like Stata, is no longer capable for processing 3GB dataset with more than 1 million records, or bigger. Now I am working with JMP pro and a little bit of R.

My slow personal laptop may be a factor too.

Could someone give me a summary of choosing proper statistical software based on the size of dataset?

Best Answer

I've got to nitpick at the term "big data". The definition varies wildly across people (and industry), but if your data set is less than 40 GB in size, I'd be apprehensive about calling it "big data". I'm Apache Spark dev certified, and the work load that requires tools like Hadoop/Spark are usually distributed data sets in the range of 100GB+. Usually big data already comes with big tools; you wouldn't be installing some piece of software to handle "big data" on your standalone machine.

Now let's focus on large files that span 1-60 GB (from personal experience). I'll call this medium sized data. Data this size will most likely come (extracted) from some SQL database, and will already follow some structure. Data this size will be more common and relevant than the exaggerated talk of "big data".

As mentioned in the comments, I've a Dell Precision T7500 that allows me to comfortably work on data sets as large as 40GB. But let's say your data (.csv, .txt, .json, etc) is only 8GB. Software like R, SAS, Python, Stata will all work fine if your hardware has at least 10GB of memory. Each has performance issues, but if you have an ETL (extract, transform, load) workload that is repetitive with expanding data sets, SAS may work best. By industry standards, SAS is well known to work well with large data sets.

See page 8:

https://support.sas.com/resources/papers/Benchmark_R_Mahout_SAS.pdf

https://support.sas.com/resources/papers/Benchmark-LASR-IMSTAT.pdf

I work regularly with R and Python, and each has it's benefits and costs. R and Python work with data in-memory, meaning that if the file is larger than your RAM, then you will have to find another solution (e.g. SQL database).

R

Recently, Microsoft purchased Revolution Analytics, which gives MS a strong hold in the opensource data science realm.

https://www.microsoft.com/en-us/cloud-platform/r-server

After purchasing Revolution Analytics, MS decided to embed R in their SQL Server 2016 edition. This gives users working with large data sets (but not exactly "big data") some good tools. You can either download SQL Server 2016 with R, or Microsoft Open R which comes with some Intel parallel matrix stuff that OpenBLAS basically already had going but whatever. Parallelizing your matrix computations will be useful if, say, you needed to compute multiple linear regression models or 1 big linear model.

Here is a presentation from the useR! 2016 conference of how R integrates with SQL for working medium sized data

https://channel9.msdn.com/Events/useR-international-R-User-conference/useR2016/Exploring-the-R--SQL-boundary

In working with large data sets, R is slow to read large files. Recent development of readr and data.table allow for faster reading of large files.

https://www.r-bloggers.com/importing-data-into-r-part-two/

Once you've read in the data, you'll most likely need to compute some descriptive statistics and data manipulation/cleaning. With that in mind, I suggest working with dplyr for increased performance.

Python

With the help of the pandas library, Python is very helpful in reading and manipulating data. I've not officially recorded data read times, but Python seems to read as fast as readr (R).

http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table

* AWS *

Finally, if you have data files that are too large for you to read, and you need, say, to "borrow" a bigger machine. You can turn to Amazon Web Services and utilize their EC2 instances for pennies per hour. Using EC2, you can port your R/Python scripts to run whatever workload you couldn't on your standalone machine.

Related Question