Solved – Approximation for Beta distribution when alpha is less than 10

beta distribution

I know that we can approximate Beta distribution to Normal distribution when the values of alpha and beta are large numbers.

In my problem alpha lies between 1 and 10, beta is always greater than 1000.
Is there any distribution (for such a combination of alpha and beta) that we can approximate it to, with a simpler closed form so that it will be easy to calculate CDF and InverseCDF?

(I have tried Kumaraswamy distribution but did not get good results due to the large values of beta.)

EDIT
"easy"- in terms of computation time in JAVA/PHP.

(I need an approximate distribution since it takes lot of time to compute the InverseCDF of Beta distribution(0.05 ms in java is also very slow for my needs).

For alpha greater than 10 i approximated Beta distribution to Normal distribution and computed InverseCDF like this which included just math. So i did not use any other library functions like Normal distribution or Beta distribution, which reduced the processing time drastically.)

Best Answer

Johnson and Kotz point out that

  1. The $B$ (Beta) and $F$ distributions are closely related: When $k$ has a $B(\nu_1, \nu_2)$ distribution, then $f$ has an $F(2\nu_1, 2\nu_2)$ distribution where

    $$f = \frac{k \nu_2}{(1-k)\nu_1}.$$

  2. The Wilson-Hilferty approximation to $\chi^2$ (or, equivalently, a $\Gamma$ distribution)--which has "remarkable accuracy"--can be applied to $F$, which itself is a ratio of $\chi^2$ distributions.

The result is that

$$z = \frac{\sqrt[3]{f} \left(1-\frac{2}{9 \nu_2}\right)+\left(\frac{2}{9 \nu_1}-1 \right)}{\sqrt{\frac{2 f^{2/3}}{9 \nu_2}+\frac{2}{9 \nu_1}}}$$

has approximately a standard Normal distribution. This provides relatively fast calculations of the CDF of either the $B$ or the $F$ distribution and is not difficult to invert.

This approximation works fairly well for $\nu_2 \ge 1000$ and $\nu_1=1$ and progressively better as $\nu_1$ increases. Here is a set of plots of the difference between the approximate CDF and the correct CDF for $\nu_2=1000$ and $\nu_1 = 1,2,3,6,10$, shown with identical scales on their axes:

CDF error plots

The error is extremely low in the right tail, where interest often focuses, but even in the left tail it rarely exceeds $0.002$ once $\nu_1 \ge 2$. As $\nu_2$ increases, the errors remain about the same.


Given the level of accuracy of this approximation, it makes little sense to compute the standard Normal CDF (or its inverse) to double-precision accuracy. This will allow for even faster computation.

One approach is to create a table of values and interpolate within it. Because the Normal CDF is so curved near the tails, it would seem a largish table might be needed for high accuracy. However, by considering Mills Ratio one might be led instead to create a (very) small table of pairs $(z, \sqrt{-2\log(\Phi(z))}$ for $z$ in the lower tail, say for $-8 \le z \le 0$, where $\Phi$ is the standard Normal CDF. Cubic spline interpolation does a great job even when these pairs are evaluated only for integral $z$ (nine values total in the entire table). To further simplify the programming (if a cubic spline interpolation routine is not readily available) one could implement a quadratic spline. Even better--and barely more difficult--is to form the weighted average of a linear interpolator (which tends to underestimate the CDF) and a quadratic interpolator (which overestimates the CDF). A weight of $0.238$ applied to the linear interpolator works well:

Comparison with normal CDF

This plot shows the differences between three interplators and the standard Normal CDF: the red one is for the linear interpolator, the gold for the quadratic interpolator, and the blue (in between) for their weighted average.

(The same technique of averaging in a linear interpolator also decreases the error by an order of magnitude when using a cubic spline.)

The weighted average is another quadratic interpolator but it is no longer a spline (it fails to have continuous first derivatives at the negative integers)--but so what? In effect, it computes the CDF with absolute accuracy better than $0.00005$ for $z\le -1$ and, by symmetry, for $z\ge 1$. Quadratic or cubic interpolation between a small number of pairs of $(z, \Phi(z))$ should work fine for $-1 \lt z \lt 1$ if such calculations are needed. (Typically only very low accuracy, say to about $0.01$, is needed in this range anyway.) The upshot is that the CDF (or, using similar methods, its inverse) can be computed to sufficient accuracy using about a dozen simple arithmetic operations and one exponentiation, at a cost of storing around a dozen precomputed values: even on an interpreted platform like Java that should go very quickly, taking a small fraction of a microsecond.


Examples

Mathematica code to produce the first figure follows.

fDist[f_, {ν1_, ν2_}] := CDF[NormalDistribution[]][((1-2/(9 ν2)) f^(1/3) - (1-2/(9 ν1))) /
    Sqrt[2 f^(2/3)/(9 ν2) + 2/(9 ν1)]];
f[k_, {ν1_, ν2_}] := (k ν2)/((1 - k) ν1);
With[{ν1 = 1, ν2 = 1000},
 Plot[fDist[f[k,{ν1,ν2}], {2 ν1,2 ν2}] - CDF[FRatioDistribution[2 ν1,2 ν2]][f[k,{ν1,ν2}]], 
    {k, 0, 10 ν1/(ν1 + ν2)}, PlotRange -> Full]
 ]

Code to perform the interpolation for $|z| \ge 1$ and plot its error:

d = Table[{z, Sqrt[-2 Log[CDF[NormalDistribution[], z]]]}, {z, -8, -0, 1}];
f1 = Interpolation[d, InterpolationOrder -> 1];
f2 = Interpolation[d, InterpolationOrder -> 2];
f[z_] := 0.238 f1[z] + (1 - 0.238) f2[z];
Plot[{Exp[-f[z]^2/2 ] - CDF[NormalDistribution[], z]}, {z, -4, -1}, PlotRange -> Full]

References

Norman L. Johnson and Samuel Kotz, Continuous Univariate Distributions - 2. J. Wiley & Sons, New York, 1970.