[Math] Calculating percentile value from mean and standard deviation of a normal distribution

normal distributionnumerical methodspercentilestatistics

I have to write some code to calculate the 95th percentile from a databaset which is normally distributed. I am easily able to calculate the mean and the standard deviation, which define the distribution. However, from those two values alone, is it possible to determine the x value of the 95th percentile? If so, could someone help me with the mathematical formula, which I will then convert into code.

Best Answer

To answer the question from your title "Calculating percentile value from mean and standard deviation of a normal distribution":

In practice one can do that (i.e. computing the normal cumulative distribution function $\Phi$) by converting the raw value to a Z-score (subtract the mean, then divide by std-dev) and then using a lookup table (sometimes called a Z-table) to convert the Z-score to percentile (well, to probability, for percentile multiply that by 100). Wikipedia has both the table(s) and examples how to use them.

If one needs more precision than a lookup table would provide there are some numerical algorithms that can compute that. The one in R's pnorm is based on

  • Cody, W. D. (1993) Algorithm 715: SPECFUN – A portable FORTRAN package of special function routines and test drivers. ACM Transactions on Mathematical Software 19, 22–32.

There are numerous others by relying on the simple transformation from $\Phi$ to the error function (erf), for which one can find numerous approximations. The paper by Soranzo and Epure (see next section) also gives an approximation formula directly as $$ \Phi(x) \approx 2^{-22^{1-41^{x/10}}} $$

Or more legible: 2**(-22**(1-41**(x/10))). Note this relies on the symmetry $\Phi(-x) = 1-\Phi(x)$ to extend it over negative arguments while preserving low error.


In the body of your question you are asking the opposite problem: "is it possible to determine the x value of the 95th percentile?" That's possible too, in general that's called the inverse cumulative cumulative or more succinctly quantile function, but for the normal distribution that function is just called probit, so that's the shortest word-like name for $\Phi^{-1}$. In R probit is implemented in qnorm. The numerical implementation of that in R is based on

Besides that, the probit has a simple algebraic formula that relates it to the inverse error function. And there are some approximation formulas for the latter as well, e.g.

$$\operatorname{erf}^{-1}(x) \approx \operatorname{sgn}(x) \sqrt{ \sqrt{\left(\frac{2}{\pi a} + \frac{\ln(1 - x^2)}{2}\right)^2 - \frac{\ln(1 - x^2)}{a}} - \left(\frac{2}{\pi a} + \frac{\ln(1 - x^2)}{2}\right) }. $$ where

$$ a = \frac{8(\pi - 3)}{3\pi(4 - \pi)} \approx 0.140012.$$

Then:

$$\operatorname{probit}(p) = \sqrt{2}\,\operatorname{erf}^{-1}(2p-1).$$

If it needs spelling out, probit will give you the z-score from the probability $p$ (percentile divided by 100). To convert the z-score to your "x" you need to then apply the opposite of the z-score transformation, i.e. multiply by std-dev and then add the mean.

If you don't care much about accuracy, you can go old school and approximate the probit by logit, e.g. compute it as

$$\operatorname{probit}(p) \approx \sqrt{\frac{\pi}{8}}\,\ \ln\left( \frac{p}{1-p} \right).$$

The latter approximation gets pretty bad as $p$ gets high or low (i.e. it's best around 0.5).

Another good approximation for probit from a recent paper by Soranzo and Epure (2014) is

$$\operatorname{probit}(p) \approx \frac{10}{\ln 41}\, \ln \left(1- \frac{\ln \frac{-\ln p}{\ln 2}}{\ln 22} \right) $$

This has low error for $p \ge 0.5$, but one can use the symmetry $ \operatorname{probit}(1-p) = -\operatorname{probit}(p) $ for $p$ below 0.5.