Solved – Random Sample from Power Law Distribution

distributionspower lawrandom-generation

I have a huge data set that is probably hundreds of millions of rows. This data follows a very skewed power law distribution. Consider the X-axis to be products and the Y-axis to be revenue from products. Almost 95% of revenue would come from 1% of the products. The distribution looks like this:

data distributed in power law

I want to generate a random sample from this distribution which approximates the original distribution.

All this data is in a huge oracle DB. I see that Oracle SQL has DBMS_RANDOM.VALUE [link] which generates pseudo-random numbers.

These are my questions:

  1. What should my sample size be in order to come close to the original data set. Consider original dataset to be 100million rows.
  2. Doesn't pseudo-random number generators follow a gaussian distribution? If so, isn't it wrong use a random-number generator fit for gaussian distribution over a data which follows a power-law distribution?
  3. How should I do random sampling over power law distributions? (this can be generalized to random sampling over any custom distribution).

Best Answer

What should my sample size be in order to come close to the original data set. Consider original dataset to be 100million rows.

  1. I'm not sure what you're asking. Suppose your data are definitely follow a power-law, and you know its parameters precisely. Then a random sample of any size from a power-law distribution with that parameter is, by definition, a set of random draws from the distribution.

Doesn't pseudo-random number generators follow a gaussian distribution? If so, isn't it wrong use a random-number generator fit for gaussian distribution over a data which follows a power-law distribution?

  1. No. You can make a PRNG for arbitrary distributions. Even if you don't have a prefab function for a particular distribution, there are many, many methods for generating exact and approximate deviates from arbitrary distributions.

How should I do random sampling over power law distributions? (this can be generalized to random sampling over any custom distribution).

  1. I'd recommend starting with this article, "POWER-LAW DISTRIBUTIONS IN EMPIRICAL DATA," by Aaron Clauset, Cosma Shalizi and MEJ Newman. It describes testing for power law data and generating power law deviates in gruesome detail, including discrete and continuous power laws, and several other variations and alternative models which are power-law like.
Related Question