[Math] Most accurate ways to calculate Percentile of a huge dataset

statistics

I am wondering what is the best way of calculating the most accurate percentile of a huge dataset (Billions of data).

I used to assume I got a Normal distribution and use Logarithmic to reduce skewness.

I also used Frequency table to approximate the 99,95,90th percentile.

Is there another way of calculating the percentile as accurately as possible?

Note that even though the performance in R is not that important, The computation last 1 hour or 2 hours doesn't matter than much. I still need a good enough performance to finish the computation within 10 hours.

Best Answer

Once you have an approximation to the $90$th percentile, for example, you can make one pass through the data and find the exact value. Set a range that you are sure the actual $90$th percentile lies in. Read the data, counting the values that fall below your range, retaining the values in your range, and counting the values above your range. The better your approximation, the narrower the range can be and the fewer values you need to save. Then using the counts, you can find the rank in your range that is the exact $90$th percentile.

Related Solutions

[Math] Confused on the relationship between Chi-square, its CDF, and p-value.

First question. You are right about being able to use software instead of tables of the chi-squared distribution. For example, if df = 9 and the chi-squared statistic is 20.16, you could look at a chi-squared table to see that $20.16 > 19.02,$ where 19.02 cuts area 0.025 from the upper tail of $Chisq(df = 9)$. You you would reject at that 2.5% level.

If you wanted a P-value, you could use software to find the probability of the chi-squared statistic being greater than 20.16. In R software this is computed as follows, where pchisq stands for the CDF of a chi-squared distribution:

 1 - pchisq(20.16, 9)
 ## 0.01695026

Thus the P-value (probability of a value more extreme than 20.16) is about 0.017. Some software will give you the P-value automatically.

Second question. As far as binning is concerned, you are right that in some instances there are alternate possible ways of binning. You do not want so many bins that the expected counts in each bin get less than about 5, otherwise the approximation of the chi-squared statistic to the chi-squared distribution is not good. Given that restriction, it is usually better to use more bins rather than fewer.

Also notice that the df of the chi-squared distribution depends directly on the number of $bins$ used, not on the overall number of $events$ counted. (I do not understand what you say about 'approximately Gaussian' in this context.)

Examples: Here is an example in which we simulate 60 rolls of a fair die, so that we expect 10 instances of each face. The observed numbers of each face are tabulated. Finally, a chi-squared test that the die is fair has a chi-squared goodness-of-fit statistic of 3.0, and a P-value of 70% (consistent with a fair die).

  face = sample(1:6, 60, rep=T)  # simulate 60 rolls of fair die
  table(face)
  ## face
  ## 1  2  3  4  5  6 
  ## 9  6 12 10 10 13 
  chisq.test(table(face))

  ##  Chi-squared test for given probabilities   # default is equal probabilities

  ## data:  table(face) 
  ## X-squared = 3, df = 5, p-value = 0.7

In the test, the default is that faces have equal probabilities unless some other probability vector is specified. The test procedure chisq.test finds the P-value as follows (and rounds):

 1 - pchisq(3, 5)
 ## 0.6999858

In our second example, we simulate 600 rolls of a die that is heavily biased in favor of faces 4, 5, and 6 (see prob vector). Here the null hypothesis that the die is fair is soundly rejected with an extremely small P-value.

  face = sample(1:6, 600, repl=T, prob=c(1,1,1,2,2,2)/9 )
  table(face)
  ## face
  ##  1   2   3   4   5   6 
  ## 59  67  80 123 135 136 
  chisq.test(table(face))

  ##  Chi-squared test for given probabilities  # default is test for 'fair' die

  ## data:  table(face) 
  ## X-squared = 62.2, df = 5, p-value = 4.263e-12

[Math] Approximate 99th Percentile of dataset

Both offered answers do not improve on the main bottleneck, which is the $O(n\log n)$ sorting running time (for an array of size $n$).

One can use a clever selection algorithm, which partially sorts the data, to avoid the full sort, and end up with a linear-time computation.

Most of the time this is used to compute the median, which is the 50-th percentile, but can be used for arbitrary selection. This makes heavy use of the main routine behind QuickSort and hence is called the QuickSelect. Here is the Wiki link for the full discussion of the field.

Best Answer

Related Solutions

[Math] Confused on the relationship between Chi-square, its CDF, and p-value.

[Math] Approximate 99th Percentile of dataset

Related Question