How much faster and how good is it to use this approximation for $\binom{n}{k}$ when $n \gg k$

algorithmsapproximationbinomial-coefficientscomputational complexitycomputer science

I'm coding an algorithm which heavily uses $\binom{n}{k}$'s for $n \gg k$ (for example $n = 10^{10}, k = 1000$). I only need rough approximations for $\binom{n}{k}$, but I need it to run fast. So I applied this approximation:
$$\binom{n}{k} = \frac{n!}{k!(n-k)!} = \frac{n(n-1)\dots(n-k+1)}{k!} \stackrel{n \text{ is large}}{\approx} \frac{n^k}{k!} \stackrel{\text{Stirling-approximation}}{\approx} \\ \approx \frac{n^k}{\sqrt{2\pi k}\left(\frac{k}{e}\right)^k} = \sqrt{\frac{1}{2\pi k}}\left(\frac{ne}{k}\right)^k$$

I coded both $\binom{n}{k}$ (two ways) and the approximnation in Python $3.7$:

from operator import mul
from fractions import Fraction
import functools
import math
from decimal import Decimal

def binom(n,k):
    return math.factorial(n)/(math.factorial(k) * math.factorial(n-k))

def comb(n,k): 
    return int(functools.reduce(mul, (Fraction(n-i, i+1) for i in range(k)), 1))

def approx_comb(n,k):
    n = Decimal(n)
    k = Decimal(k)
    return int((n*Decimal(math.exp(1))/k)**k*Decimal(math.sqrt(1/(2*Decimal(math.pi)*k))))

The binom function basically never returned anything, it always just returned with an OverflowError: integer division result too large for a float.

The comb function multiplies the terms $\frac{n}{k}, \frac{n-1}{k-1}, \dots, \frac{n-k+1}{1}$, which was a lot more efficient:

%%time
comb(100000000,1000)

>> Wall time: 24.4 ms
>> 24727856381885447097491872465571346139452385321184242788899766723126597918273665
69637235850783343618972942790141736611652393840424422491889743195814202183294476495
34475997640077231761898939979394571033582633059292894746931865217877366183292362...

And my approx_comb function returned an approximation in about tenth of the time:

%%time
approx_comb(100000000,1000)

>> Wall time: 1.95 ms
>> 24853752492343170386331401240000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000...

(Note: The Decimal class' initializer truncates the result to the first $28$ significant digits.)

These results were very close to each other, the approximate value is only about $1.005$ times the real value.

approx_comb(100000000,100000)/comb(100000000,100000)

>> 1.0050912666473568

My questions are:

Let's say $k$ is small, so $k \le 1000$ and $n$ is always very large, let's say $n \ge 1\,000\,000$.

How close will this approximation be? For example, Will it ever leave the bounds of $\frac{1}{2}\binom{n}{k} \le \text{ approximation } \le 2\binom{n}{k}$?
My approximation always seemed to be around $10$ times faster to calculate. How much faster will this approximation be exactly for larger $n$'s? How to calculate the speed of these calculations using the big $\mathcal{O}$ notation?

Best Answer

Big-O notation is a bit tricky for calculations like this, because we have to know details of things such as how the computer multiplies Fraction objects or performs the ** operation in order to say how much more time those operations take when you pass very large numbers to them.

For the sake of comparing the functions, however, you might assume for simplicity (if you don't have better information) that operations like ** and math.sqrt take constant time. Under that assumption, your approx_comb function is $\mathcal O(1)$ and your comb function is $\mathcal O(k)$. If comb only takes about $10$ times as long as approx_comb when $k=1000,$ rather than $1000$ times as long, you might conclude that the constant factor in the running time of approx_comb is $100$ times as large as for comb.

But given all the uncertain details inside the two functions, I would say the best way to estimate their big-O performance is to run some examples with different numbers and see how the times scale. For example, does comb really scale linearly with $k$, and does it really not matter whether $n$ is $10000$ or $1000000000$?

Update: The simple assumption is definitely incorrect. Perhaps it is possible to control precision in a way that avoids long running times, but when $n = 1000000,$ the running times of the approximations as written are very sensitive to $k$ when $k > 100$. I did not look at enough data points to estimate the asymptotic time, but it seems clearly worse than $\mathcal O(k)$. For $k = 1000$ the running time is also somewhat sensitive to $n.$

In favor of the approximations, the comb function is also clearly worse than $\mathcal O(k)$.

Regarding accuracy, provided the individual operations don't suffer some kind of overflow error, approx_comb will always give a result larger than the true value for $k > 1,$ since then

$$ n^k > n(n-1)\cdots(n - k + 1). $$

You might want to take advantage of the fact that for $k > 1,$

$$ n(n-k+1) < \left(n - \frac{k - 1}2\right)^2 < n^2 $$

and similarly

$$ (n-j)(n-k+j+1) < \left(n - \frac{k - 1}2\right)^2 < n^2 $$

for $0 < j < k - j - 1.$ In other words, you can take the terms of $n(n-1)\cdots(n - k + 1)$ in pairs from both ends of the expression, working from the outside toward the middle, and the product of each pair is less than $\left(n - \frac{k - 1}2\right)^2$. If $k$ is even this accounts for all the terms $n(n-1)\cdots(n - k + 1)$, but if $k$ is odd you have a leftover term exactly equal to $n - \frac{k - 1}2$. In either case, you have that

$$ n(n-1)\cdots(n - k + 1) < \left(n - \frac{k - 1}2\right)^k < n^k, $$

so you can improve your approximation by substituting $n - \frac{k - 1}2$ for $n$ in your formula.

By the way, int rounds downward rather than rounding to the nearest integer. Normally I would say use round or add $0.5$ to the result before calling int, but in this case the approximation is always greater than the true answer, which is an integer, so rounding down is appropriate.

You might also want to look at https://en.wikipedia.org/wiki/Binomial_coefficient#n_much_larger_than_k, which gives the approximation

$$ \binom nk \approx \exp\left( \left(n + \tfrac12\right) \ln\left(\frac{n + \tfrac12}{n - k + \tfrac12}\right) + k \ln\left(\frac{n - k + \tfrac12}k\right) - \frac{\ln(2\pi k)}2 \right). $$

In this case I'm not sure that rounding down is correct, so I would round to nearest.

For reference, here are some python functions I tested:

from operator import mul
from fractions import Fraction
import functools
import math
from decimal import Decimal
import timeit

def comb(n,k): 
    return int(functools.reduce(mul, (Fraction(n-i, i+1) for i in range(k)), 1))

def approx_comb_a(n,k):
    n = Decimal(n)
    k = Decimal(k)
    base = n * Decimal(math.exp(1)) / k
    term2 = 2 * Decimal(math.pi) * k
    return int(base**k / term2.sqrt())

def approx_comb_b(n,k):
    n = Decimal(n)
    k = Decimal(k)
    base = (n - Decimal(0.5) * (k - 1)) * Decimal(math.exp(1)) / k
    term2 = 2 * Decimal(math.pi) * k
    return int(base**k / term2.sqrt())

def approx_comb_c(n,k):
    n1 = Decimal(n + 0.5)
    k = Decimal(k)
    nk = n1 - k
    base1 = n1 / nk
    base2 = nk / k
    term3 = 2 * Decimal(math.pi) * k
    return int(base1**n1 * base2**k / term3.sqrt())

And here are some results:

>>> approx_comb_a(1000000,1000)/comb(1000000,1000)
1.6483088671229085
>>> approx_comb_b(1000000,1000)/comb(1000000,1000)
1.0001250494328289
>>> approx_comb_c(1000000,1000)/comb(1000000,1000)
1.0000833367611621

As you can see, all approximations are within a factor of $2$ of the correct result, but the simple approximation using $n^k$ has a $64.8\%$ error whereas the approximation using $(n - (k - 1)/2)^k$ has only about a $0.0125\%$ error, and the error for the third approximation is about $\frac23$ of that. Running times were not much different among the three approximations.

My questions are:

Best Answer

Related Solutions

Logarithms – Approximating Logs and Antilogs by Hand

Binomial Coefficients – How Does Stirling’s Approximation Imply $\binom{n}{k}\approx \frac{n^k}{k!}$?

Related Question