Edit: The indicator random variables have been radically changed, so that one gets a very quick computation of the mean and variance.
Let $Y$ be the number of reds drawn before the first blue. Suppose that the red balls have labels $1, 2, 3,\dots,r$. Let $X_i=1$ if red ball with label $i$ is drawn before the first blue is drawn, and let $X_i=0$ otherwise.
Then $Y=X_1+\cdots+X_r$. Note that the number of draws up to an including the first blue is $Y+1$. But $Y+1$ and $Y$ have the same variance.
To calculate the variance of $X_i$, we first calculate the mean. By linearity of expectation we have
$$E(Y)=E(X_1)+E(X_2)+\cdots+E(X_r).$$
By symmetry, all the $E(X_i)$ are the same. The probability red with label $i$ comes before any of the $b$ blue is $\frac{1}{b+1}$. It follows that $E(Y)=\frac{r}{b+1}$.
To calculate the variance of $Y$, calculate $E(X_1+\cdots +X_{r})^2$ and subtract the square of $E(Y)$, which we know.
To find $E(X_1+\cdots+X_r)^2$, expand the square and use the linearity of expectation. We know the expectation of $\sum X_i^2$, since $X_i^2=X_i$. So we need the expectations of the cross terms.
For $i\ne j$, $X_iX_j=1$ if both red ball $i$ and red ball $j$ come before any blue. This has probability $\frac{2}{(b+2)(b+1)}$. Multiply by $2\binom{r}{2}$ to get the sum of the cross terms.
We use the notation from the following MSE
link with $m$ for
the number of trials and $n$ for the number of different types of
coupons. We treat the special case where there are $j$ instances of
each type and we are sampling without replacement.
We ask about the probability of obtaining the distribution
$$\prod_{q=1}^n C_q^{\alpha_q}$$
where $\alpha_q$ says we have that many instances of type $C_q.$
We obtain
$$\frac{(nj-\sum_{q=1}^n \alpha_q)!}{(nj)!}
\prod_{q=1}^n \frac{j!}{(j-\alpha_q)!}.$$
Therefore the sequences according to probability of length $m$ of $n$
types of coupons without replacement and a maximum of $j$ coupons of
each type are given by
$$m! [z^m]
\left(\sum_{k=0}^j \frac{j!}{(j-k)!}\frac{z^k}{k!}\right)^{n}
= m! [z^m] (1+z)^{nj} = {nj\choose m} \times m!.$$
We then obtain from first principles the formula
$$P[T=m] = \frac{1}{m!} {nj\choose m}^{-1} \times
n \times j \times (m-1)! [z^{m-1}]
\left(\sum_{k=1}^j \frac{j!}{(j-k)!}\frac{z^k}{k!}\right)^{n-1}
\\ = nj \times \frac{1}{m} {nj\choose m}^{-1}
[z^{m-1}] \left(-1 + (1+z)^j\right)^{n-1}.$$
This becomes
$$nj \times \frac{1}{m} {nj\choose m}^{-1}
[z^{m-1}] \sum_{q=0}^{n-1} {n-1\choose q} (-1)^{n-1-q} (1+z)^{qj}
\\ = {nj-1\choose m-1}^{-1}
\sum_{q=0}^{n-1} {n-1\choose q} (-1)^{n-1-q} {qj\choose m-1}.$$
Observe that
$${qj\choose m-1} {nj-1\choose m-1}^{-1}
= \frac{(qj)! \times (nj-1-(m-1))!}{(qj-(m-1))! \times (nj-1)!}
\\ = {nj-1\choose qj}^{-1} {nj-1-(m-1)\choose qj-(m-1)}.$$
We record for the probabilities the formula
$$\bbox[5px,border:2px solid #00A000]{ P[T=m] =
\sum_{q=0}^{n-1} {n-1\choose q} (-1)^{n-1-q}
{nj-1\choose qj}^{-1} {nj-1-(m-1)\choose nj-1-qj}.}$$
Start by verifying that this is a probability distribution. We obtain
for the sum in $m$
$$\sum_{m=n}^{(n-1)j+1} {nj-1-(m-1)\choose nj-1-qj}
\\ = [z^{nj-1-qj}] \sum_{m=n}^{(n-1)j+1} (1+z)^{nj-1-(m-1)}
\\ = [z^{nj-1-qj}] \sum_{m=j-1}^{n(j-1)} (1+z)^m
= [z^{nj-qj}] ((1+z)^{n(j-1)+1} - (1+z)^{j-1}).$$
We have $nj-qj\ge j$ so only the first term contributes and we obtain
$$\sum_m P[T=m] =
\sum_{q=0}^{n-1} {n-1\choose q} (-1)^{n-1-q} {nj-1\choose qj}^{-1}
{n(j-1)+1\choose nj-qj}
\\ = \frac{n(j-1)+1}{j}
\sum_{q=0}^{n-1} {n-1\choose q} \frac{(-1)^{n-1-q}}{n-q}
{nj-1\choose nj-qj-1}^{-1} {n(j-1)\choose nj-qj-1}$$
We get for the rightmost pair of binomial coefficients
$$\frac{(n(j-1))! \times (qj)!}{(nj-1)! \times (qj+1-n)!}
= {nj-1\choose n-1}^{-1} {qj\choose n-1}$$
which yields for the sum
$$\frac{n(j-1)+1}{j} {nj-1\choose n-1}^{-1}
\sum_{q=0}^{n-1} {n-1\choose q} \frac{(-1)^{n-1-q}}{n-q} {qj\choose n-1}
\\ = \frac{n(j-1)+1}{nj} {nj-1\choose n-1}^{-1}
\sum_{q=0}^{n-1} {n\choose q} (-1)^{n-1-q} {qj\choose n-1}
\\ = \frac{n(j-1)+1}{nj} {nj-1\choose n-1}^{-1} {nj\choose n-1}
\\ + \frac{n(j-1)+1}{nj} {nj-1\choose n-1}^{-1}
\sum_{q=0}^{n} {n\choose q} (-1)^{n-1-q} {qj\choose n-1}
\\ = \frac{n(j-1)+1}{nj}\frac{nj}{nj+1-n}
\\ + \frac{n(j-1)+1}{nj} {nj-1\choose n-1}^{-1}
[z^{n-1}] \sum_{q=0}^{n} {n\choose q} (-1)^{n-1-q} (1+z)^{qj}
\\ = 1 - \frac{n(j-1)+1}{nj} {nj-1\choose n-1}^{-1}
[z^{n-1}] (1-(1+z)^j)^n$$
Now observe that $[z^{n-1}] (1-(1+z)^j)^n = 0$ hence everything
simplifies to $$1$$ and we have a probability distribution.
Continuing with the expectation we have the following closed form:
$$\bbox[5px,border:2px solid #00A000]{ \mathrm{E}[T] =
\sum_{q=0}^{n-1} {n-1\choose q} (-1)^{n-1-q}
{nj-1\choose qj}^{-1}
\sum_{m=n}^{(n-1)j+1} m {nj-1-(m-1)\choose nj-1-qj}.}$$
By means of plotting strategy let us examine some of these. Here are
the first few for eight types of coupons starting at $j=1:$
$$8,{\frac {76627}{6435}},{\frac {76801}{5434}},{\frac {7473667}{480675}},
{\frac {1318429}{79794}},\ldots$$
and here is the initial segment for ten types of coupons:
$$10,{\frac {707825}{46189}},{\frac {7008811}{380380}},
{\frac {266299459}{13042315}},{\frac {182251913}{8360638}},
{\frac {748880445829}{32831263465}},\ldots$$
Careful inspection of these values reveals that we cannot hope for
additional simplification when $j\ge 2$ because if it were possible it
would have appeared in these sample values. We do see however that the
case $j=1$ should be possible, the value being $n$ (we always finish
after $n$ draws if there is only one instance of each coupon).
We now do this calculation, which is trivial, but nontheless a useful
sanity check, starting with
$$\sum_{q=0}^{n-1} {n-1\choose q} (-1)^{n-1-q}
{n-1\choose q}^{-1}
\sum_{m=n}^{n} m {n-1-(m-1)\choose n-1-q}
\\ = \sum_{q=0}^{n-1} (-1)^{n-1-q}
\times n {n-1-(n-1)\choose n-1-q}
\\ = (-1)^{n-1-(n-1)} \times n \times {0\choose n-1-(n-1)} = n.$$
It certainly seems like a worthwhile challenge to prove that the
closed form for $\mathrm{E}[T]$ is $n H_n$ in the limit, which is
confirmed by the numerical evidence.
We did verify the formula for the expectation in software, as follows.
It really is quite remarkable that the output from this program is in
excellent agreement with the closed form on all values that were
tested.
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <time.h>
#include <string.h>
int main(int argc, char **argv)
{
int n = 6 , j = 3, trials = 1000;
if(argc >= 2){
n = atoi(argv[1]);
}
if(argc >= 3){
j = atoi(argv[2]);
}
if(argc >= 4){
trials = atoi(argv[3]);
}
assert(1 <= n);
assert(1 <= j);
assert(1 <= trials);
srand48(time(NULL));
long long data = 0;
for(int tind = 0; tind < trials; tind++){
int src[n*j];
for(int cind = 0; cind < n*j; cind++)
src[cind] = cind/j;
int seen = 0; int steps = 0;
int dist[n];
for(int cind = 0; cind < n; cind++)
dist[cind] = 0;
while(seen < n){
int cpidx = drand48() * (double)(n*j-steps);
int coupon = src[cpidx];
for(int cind=cpidx; cind < n*j-steps-1; cind++)
src[cind] = src[cind+1];
steps++;
if(dist[coupon] == 0)
seen++;
dist[coupon]++;
}
data += steps;
}
long double expt = (long double)data/(long double)trials;
printf("[n = %d, j = %d, trials = %d]: %Le\n",
n, j, trials, expt);
exit(0);
}
Best Answer
Mistakes in your analysis
To make things clear, let $N$ be the number of draws it takes to see all colors, so $E[N]$ is your quantity of interest.
There are two things wrong with this:
First, a unit mismatch. For $D_m$, the number of draws is fixed, and the number of distinct colors scene is random. This is not the same as having the number of colors be fixed, and the number of draws to reach that target being random. Therefore, there is no relation between $D_m$ and $N$.
Your definition of $D_m$ still fits in a "without-replacement" paradigm. You could say that it is approximately with-replacement.
This calculation assumes drawing with replacement, which is only approximately correct. Approximations should always be notated with $\approx$.
The calculation still computes the approximation incorrectly. The probability that a color is present after $m$ draws with replacement would be $$ 1-(1-\tfrac1{\color{red}{n}})^m \approx 1-e^{m/n} $$ Then, linearity of expectation would imply $$ E[D_m]\approx n(1-e^{m/n}) $$ The probability without replacement is $$ 1-\binom{(n-1)k}{m}\Big/\binom{nk}{m}, $$ so the exact expression is $$ E[D_m]=n\left(1-\binom{(n-1)k}{m}\Big/\binom{nk}{m}\right) $$
The problem here is your language is imprecise. What does "in expectation we will draw every ball to see each type" mean? In order to find $E[N]$, you will need a different method.
For each $i\in \{1,\dots,n\}$, $A_{i,m}$ be the event that you do not see color number $i$ in the first $m$ draws. Note $$ P(A_{1,m}\cap A_{2,m}\cdots \cap A_{j,m})=\binom{(n-j)k}{m}\Big/\binom {nk}m $$
Now, letting $B_m$ be the event that some color is missing after $m$ draws, we can use the above result, along with the principle of inclusion exclusion, to conclude $$ P(B_m)=\sum_{j=1}^n(-1)^{j-1}\binom{n}{j}\binom{(n-j)k}{m}\Big/\binom {nk}m $$ Warning: For this formula to work, we need to use the following convention: $\binom{a}{b}$ is defined to be zero when $b>a$ is negative. This means we do not have to worry about summation bounds.
Finally, $$ E[N]=\sum_{m=0}^{\infty}P(N>m)=\sum_{m=0}^{nk} P(B_m)= \sum_{m=0}^{nk}\sum_{j=1}^n(-1)^{j-1}\binom{n}{j}\binom{(n-j)k}{m}\Big/\binom {nk}m $$