Solved – Is calculating a percentile the same as evaluating a cumulative density function

distributionsquantiles

I'm trying to make the jump from the idea of a percentile, say, over the real number line (where the nth percentile is simply the position in which n% of data points are below it, and 100-n% are above it), to the idea of the area under a probability density function.

If I want to know the 50% percentile from a set of numbers, I'll find the point in which half the numbers are below, half the numbers are above. That's the 50% percentile, and I'm done.

If I want to know the 50% percentile from a distribution, say, a Z-score, I'll evaluate the cdf from 0 – 50, and I'm done. Am I saying this correct?

This feel right intuitively, but I need some discussion to hammer it home. Or, I could be completely off…

Best Answer

You are close but not exactly right. Remember that the area under a probability distribution has to sum to 1. The cumulative density function (CDF) is a function with values in [0,1] since CDF is defined as $$ F(a) = \int_{-\infty}^{a} f(x) dx $$ where f(x) is the probability density function. Then 50th percentile is the total probability of 50% of the samples which means the point where CDF reaches 0.5. Or in more general terms, the p'th percentile is the point where the CDF reaches p/100.

Related Solutions

Solved – Are there techniques to merge two cumulative distribution functions

If I understand you correctly, your question can be answered by using a data structure to estimate quantiles - one for each server instance, and that these N data structures has the ability to be 'merged' (or 'reduced') with the rest to get aggregate estimates of the entire cluster.

I believe the TDigest data structure should be able to help you: https://github.com/tdunning/t-digest

Here is the description:

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.

This way, you can calculate estimates for each server, AND, also combine these aggregated estimates for the entire cluster, without processing ALL the raw data for the entire cluster again.

More details on the underlying techniques can be followed from the link. I believe it is based on the q-digest algorithm.

Solved – Using bootstrap to estimate the 95th percentile and confidence interval for skewed data

Maybe you don't have to adopt normal distribution. Why don't you just use the 2.5% percentile and 97.5% percentile of boot strap sample percentiles as the confidence interval? I simulated usual bootstrap method and it seems work when comparing to the method using binomial distribution.

I don't have your data so I made some data from gamma distribution which is skewed.

#making data
set.seed(1)
x<-rgamma(7000,5,0.3)
hist(x)

x_sorted<-sort(x)

x_sorted[round(7000*0.95)] # estimate of 95% x

This is the bootstrap code I ran.

#method1 bootstrap
bootx95p(x_sorted,1000,0.05)

bootx95p<-function(x,b,alpha){
# x is your data
# b is the number of bootstraps.
# alpha is your type I error

n<-length(x)
p<-round(n*0.95)
xp<-rep(0,b)

for(i in 1:b){
x_boot<-sample(x,n,replace=TRUE)
x_boot<-sort(x_boot)
xp[i]<-x_boot[p]
}

xp<-sort(xp)
a<-round(alpha/2*(b+1))
CIup<-xp[b-a+1]
CIlow<-xp[a]

cat(' CI (',CIlow,', ',CIup,')','\n')
hist(xp)

}

The estimate would be 30.56664 and this is the result of the bootstrap method : CI ( 30.0623 , 31.08694 ) The below is the histogram of the distribution of 95th percentile of sample percentiles acquired from bootstrap method.

And this is the method you also suggested using binomial distribution.

 #2 using bionomial
    up=qbinom(0.975,7000,0.95)
    low=qbinom(0.025,7000,0.95)
    x_sorted[up]
    x_sorted[low]

The result is quite similar :

> x_sorted[up]
[1] 31.08901
> x_sorted[low]
[1] 30.04189

As someone may have noticed from my English, I am not a native English speaker and even learning English. So It would be appreciated if someone correct my grammar.

Best Answer

Related Solutions

Solved – Are there techniques to merge two cumulative distribution functions

Solved – Using bootstrap to estimate the 95th percentile and confidence interval for skewed data

Related Question