Solved – How to make representative sample set from a large overall dataset

sample-sizesamplingvalidation

What are the statistical techniques to create a sample set, which is representative of the entire population (with a known confidence level)?

Also,

  • How to validate, if the sample fits the overall dataset?
  • Is it possible, without parsing the entire dataset (which could be billions of records)?

Best Answer

If you don't wish to parse the entire data set then you probably can't use stratified sampling, so I'd suggest taking a large simple random sample. By taking a random sample, you ensure that the sample will, on average, be representative of the entire dataset, and standard statistical measures of precision such as standard errors and confidence intervals will tell you how far off the population values your sample estimates are likely to be, so there's no real need to validate that a sample is representative of the population unless you have some concerns that is was truly sampled at random.

How large a simple random sample? Well, the larger the sample, the more precise your estimates will be. As you already have the data, conventional sample size calculations aren't really applicable -- you may as well use as much of your dataset as is practical for computing. Unless you're planning to do some complex analyses that will make computation time an issue, a simple approach would be to make the simple random sample as large as can be analysed on your PC without leading to paging or other memory issues. One rule of thumb to limit the size of your dataset to no more than half your computer's RAM so as to have space to manipulate it and leave space for the OS and maybe a couple of other smaller applications (such as an editor and a web browser). Another limitation is that 32-bit Windows operating systems won't allow the address space for any single application to be larger than $2^{31}$ bytes = 2.1GB, so if you're using 32-bit Windows, 1GB may be a reasonable limit on the size of a dataset.

It's then a matter of some simple arithmetic to calculate how many observations you can sample given how many variables you have for each observation and how many bytes each variable takes up.

Related Question