[Math] Calculate Percentile of Skewed Dataset

normal distributionstatistics

I am looking to calculate the 90th, 95th and 99th percentile of a dataset.
Normally it is distributed almost Normally.
So I can use Z-value = 1.282, 1.645, 2.326 to approximate the percentiles as follow: X = u + z*σ

Now what if the Dataset is Skewed.
How do I find the approximation of percentiles of the dataset now?
Do I take the Natural Log of the data to smooth out the skew?

Is there anything I can do with the Calculation of Skewness to get a better mean and standard deviation for my dataset so I can get a more accurate percentile approximation?

Thanks.

Best Answer

The straightforward way is to use the definition of percentile (which differs a bit from text to text and software to software) and count observations. This works for data from any distribution. (Differences in definitions do not matter much for large samples.) Roughly speaking, the 90th percentile is a value below which one finds not more than 90% of the observations, and above which one finds not more than 10% of them.

Here are a few examples of percentiles for a couple of datasets in R, one normal (symmetrical) and one exponential (right-skewed). Notice that percentiles of small samples do not necessarily match percentiles of the populations from which they are sampled. (The method you have been using for normal data seems to conflate the two kinds of percentiles.) In the data displays below the numbers in brackets give the index of the first observation in the that row.

x = round(sort(rnorm(50, 100, 15)), 1);  x  # generate 50 obs from Norm mean=100, SD=15
x
 [1]  61.1  69.4  71.1  73.0  73.9  77.5  78.0  78.0  79.0  81.5
[11]  83.4  85.9  86.5  87.8  87.9  88.0  88.8  90.0  90.7  91.3
[21]  92.0  93.0  95.3  97.9  97.9  99.2  99.2 100.0 100.2 101.0
[31] 102.9 103.4 104.3 104.6 105.4 107.2 108.5 108.6 109.5 109.6
[41] 111.3 111.5 111.9 118.0 119.5 119.6 119.6 119.9 121.5 128.4
quantile(x, .9)  # 90th percentile
   90% 
119.51 
quantile(x, .7)  # 70th percentile
   70% 
105.94 
qnorm(c(.9, .7), 100, 15)  # 90th and 70th percentiles of POPULATION
[1] 119.2233 107.8660

x = round(sort(rexp(60, rate=1/50)), 1);  x  # generate 60 obs from EXP mean=50
x
 [1]   0.0   0.4   0.5   0.9   0.9   1.3   2.4   3.8   4.4   6.1
[11]   7.5   7.9   8.0   9.0   9.9  11.0  11.5  13.3  15.4  16.4
[21]  19.6  20.3  25.1  25.4  28.0  28.8  29.3  29.5  31.1  32.0
[31]  32.1  34.2  37.2  40.6  42.0  42.0  49.5  52.2  55.6  56.8
[41]  59.6  64.3  73.9  74.7  78.7  87.9  90.1  95.4  97.2 105.2
[51] 110.3 113.8 114.6 172.5 187.0 188.2 188.8 207.9 259.7 265.3
quantile(x, .9)
   90% 
173.95 
quantile(x, .7)
  70% 
67.18 
qexp(c(.9, .7), rate=1/50)  # 90th and 70th percentiles of POPULATION
115.12925  60.19864