Expected number of non-uniform draws until collision

expected valuemultinomial-distributionprobability

Edit May 9 — high-level summary of the issue here. $R$ gives a good proxy for estimating collision time, with a slight undercount. Random matrices and graphs give distributions with longer time until collision than what you'd expect by looking at their values of $R$

Suppose I do IID draws from multinomial distribution with probabilities $p_1,\ldots,p_n$. Is there a nice approximation for the $x(p)$, the expected number of draws from $p$ until collision, ie, drawing some $i$ more than once?

In the case of $p_1=p_2=\dots=p_n$, this was shown to be the following

$$x(p)=\sqrt{\frac{\pi n}{2}}$$


From simulations, the following appears to be a lower bound

$$x(p)\approx \sqrt{\frac{\pi R}{2}}$$

where $R=\frac{1}{\|p\|^2}$ and $\|p\|^2=p_1^2+\ldots+p_n^2$ is the probability of observing a collision in two IID draws from $p$.

enter image description here

Distributions were taken to be of the form $p_i\propto i^{-c}$ with $c$ varying

Best Answer

An approximation

Let $X_i$ ($i=1,2 \cdots $) be the result of $i-$th draw, let $C_{i,j}\equiv X_i = X_j$ ($i<j)$ be the colission event for the pair $(i,j)$. Let the event $D_k = \cap_{i<j\le k} C_{i,j}$ correspond to having all first $k$ draws different (no collisions); let $d_k = P(D_k)$. Clearly, $d_1=1$ , $d_{n+1}=0$.

Let $F_k$ be the event of having our first collision at draw $k$, and let $f_k=P(F_k)$

Then $$\begin{align} f_k &=\sum_{i=1}^{k-1} P(X_k = X_i\cap D_{k-1}) \\ &= P(D_{k-1})\sum_{i=1}^{k-1} P(X_k = X_i | D_{k-1}) \\ &= d_{k-1} \, (k-1) P(X_k = X_1 | D_{k-1}) \\ \tag{1} & \approx d_{k-1} \, (k-1) P(X_k = X_1 ) \\ &= d_{k-1} \, (k-1) \, \alpha \\ \end{align}$$

where $\alpha = \sum_{m=1}^n p_m^2$. We have $f_1=0$ and $f_2 = \alpha$.

Also: $$\begin{align} d_k &= d_{k-1} -f_k \\ \tag{2} \end{align}$$

This system is solved by

$$ d_k = \prod_{j=1}^{k-1} (1- j \alpha) \\ \tag{3}$$

(Actually, we need to truncate the product so that $d_k \ge 0$)

Finally, the expected time for next colission would be $$x(p) = 2 + \sum_{k=2}^n d_k = 2 + \sum_{k=2}^n \prod_{j=1}^{\min(k-1,1/\alpha)} (1- j \alpha)$$

Update: This approach does not seem promising. Actually $P(X_k = X_1 | D_{k-1}) \approx \alpha$ is only valid near $k=2$ ; at the other extreme $P(X_{n+1} = X_1 | D_{n}) = 1/n$.

Alternatively, one might resort to Newton's identities, noticing that our $d_k$ correspond to $ k! e_k$ in that page. This allows to compute $d_k$ (and hence $x(p)$) numerically, from $\alpha$ and the other power sums (which correspond to $p_k$ there) - as in JimB's answer. But, again, this does not seem to provide much insight about the asymptotics or approximations.


Added: another way to get the numerical exact solution, probably useless (numerically unstable), using Poissonization and numerical depoissonization.

Matlab/Octave code:

n = 20 

c = 1  % constant for p(i) = i^(-c)
pp = [1:n].^(-c); % probabilities
pp = pp/sum(pp);  % normalized

POI = zeros(n+1); % Poisson transform matrix
for i = 0:n
 for j = 0:n
  POI(i+1,j+1) = exp(-i)*(i)^(j)/factorial(j);
 endfor
endfor

% prob of no collision, for the Poisson model
gk=zeros(n+1,1); 
for k=0:n
   gk(k+1) = exp(-k) * prod(1+k*pp);
endfor

% prob of no collision, for the multinomial (exact) model
dk = linsolve(POI,gk) ;

%expected time
sum(dk)

This gives $x(p) = 4.6333757$