How to represent the sampling distribution of a random variable which has probability $\rho$ of being present

binomial distributionexpected valueprobabilitysampling

Say we have a continuous random variable $S\in [0,1]$ which has an unknown probability distribution $p_s$. Now suppose we want to find the expected value (and distribution) of $N$ trials where in each trial we draw from $S$ with probability $\rho$ and we get $0$ with probability $1-\rho$ (we want the expected value just in terms of $E[S]$).

In other words, in each trial we have probability $\rho$ of getting something, and that something is a random variable $S$ distributed according to $p_s$. The rest of the time we get nothing at all.

We can model the number of trials which return something as a standard binomial distribution. That is, out of $N$ trials the number of times we draw from $S$ is modeled by $B(N, \rho)$ where $B$ is the standard binomial distribution.

If $S$ were always 1, we would be done. The sampling distribution would just be $B(N, \rho)$ and the expected value would be $N\rho$ (correct me if I am wrong). But what about this case where $S$ has some unknown distribution? How do we model this sampling distribution?

Best Answer

Let $Y$ be the outcome of one trial. The distribution of $Y$ is $(1-\rho) \cdot \delta_0 + \rho \cdot p_s$, where $\delta_0$ is a point mass at $0$. Then after $N$ (independent, I'm assuming) trials the total outcome is $T_N := Y_1 + \dots + Y_N$, where $Y_1, \dots, Y_N$ are iid copies of $Y$.

I'm not sure exactly what you mean by "model this distribution", but in particular the expectation is easy to calculate: $\mathbb{E}(Y) = \rho \mathbb{E}(S)$, so $\mathbb{E}(T_N) = N \rho \mathbb{E}(S)$.

It's also possible to calculate the variance using the law of total variance. Let $I$ be the random variable that indicates whether $0$ is chosen or sampling from $p_s$ is chosen. Then \begin{align*} \operatorname{var}(Y) &= \mathbb{E}(\operatorname{var}(Y|I)) + \operatorname{var}(\mathbb{E}(Y|I)) = \mathbb{E}((1-\rho) \delta_0 + \rho \delta_{\operatorname{var}(S)}) + \operatorname{var}((1-\rho) \delta_0 + \rho \delta_{\mathbb{E}(S)}) \\ &= \rho \operatorname{var}(S) + \rho \mathbb{E}(S)^2 - \rho^2 \mathbb{E}(S)^2, \end{align*} and $\operatorname{var}(T_N) = N\cdot \operatorname{var}(Y)$.

EDIT: Here is an alternate way to calculate the same quantities. As you have already pointed out, we can also express $T_N$ as $\sum_{i=1}^{M} S_i$ where $M \sim \operatorname{Bin}(N,\rho)$ and the $S_i$ are iid copies of $S$. Then $\mathbb{E}(T_N)$ and $\operatorname{var}(T_N)$ can be calculated using standard results for sums of a random number of iid random variables (see for example these lecture notes)

Related Solutions

[Math] What distribution models number of trials needed for given number of successes and success rate

I figure this one out. :)

I can model it using a Negative Binomial: https://en.wikipedia.org/wiki/Negative_binomial_distribution

First, let us change the values of my case scenario, just to make it clearer. "Case scenario: a retro-virus infects a healthy cell. The virus programs the cell to brew little viruses, at a rate of 0.2 per-sec, until finally the cell bursts when the number of virus inside it is 5. How to model this?"

We can model number of failures $Y$ as $Y\sim\mathcal{NB}(5,0.2)$. That answers the question, how many failed trails do we have, when we need 5 successful at a probability rate of 0.2. But we do not want failed trials, we want total trials, and total trials = failed trials + successful trials. We know successful trials, which is 5, so our random variable $X$ is such that $X\sim5+\mathcal{NB}(5,0.2)$.

In fact, comparing the random generator function I proposed in the question with the negative binomial random generator (with this adjustment):

par(mfrow=c(1,2))
hist(sapply(1:1e5, function(x) rmy(5, 0.2)))
hist(5+rnbinom(1e5, 5, 0.2))

testing distribution

All functions mean, sd and summary are consistent as well.

Find the binomial probability mass function of a binomial random variable

This is a hierarchical model:

$$X \sim \operatorname{Binomial}(n,p) \\ Y \mid X, d \sim \operatorname{Binomial}(N, r_d) \\ r_d = \Pr[X \ge d] = \sum_{x=d}^n \binom{n}{x} p^x (1-p)^{n-x}.$$

Then the unconditional or marginal distribution of $Y$ is

$$\Pr[Y = y \mid N, n, p, d] = \binom{N}{y} \left(\sum_{x=d}^n \binom{n}{x} p^x (1-p)^{n-x} \right)^y \left( \sum_{x=0}^{d-1} \binom{n}{x} p^x (1-p)^{n-x} \right)^{N-y}.$$ We can also write this as $$\Pr[Y = y \mid N, n, p, d] = \binom{N}{y} (1-p)^{Nn} \left(\sum_{x=d}^n \binom{n}{x} \left(\frac{p}{1-p}\right)^x \right)^y \left( \sum_{x=0}^{d-1} \binom{n}{x} \left(\frac{p}{1-p}\right)^x \right)^{N-y}.$$

However, not much else can be done to simplify this expression any further. It turns out that the unconditional distribution is still binomial, because the probability $r_d$ does not depend on a realization of $X$; rather, it is a function only of the model parameters $n, p, d$.

For example, say we take $N = 50$ batches, each of which has size $n = 7$ items, and the probability of observing a defective item is $p = 0.01$. If we require the observation of at least $d = 1$ defect in a batch to reject it, then the random number of rejected batches is binomial with parameters $N = 50$ and $$r_d = \sum_{x=1}^7 \binom{7}{x} (0.01)^x (0.99)^{7-x} = 1 - \binom{7}{0} (0.99)^7 = 0.0679347.$$ The probability of rejecting at least $2$ batches out of the $50$ would be $$\Pr[Y \ge 2] = 1 - \Pr[Y \le 1] = 1 - \binom{50}{0} r_d^0 (1-r_d)^{50} - \binom{50}{1} r_d^1 (1-r_d)^{49} \approx 0.862203.$$

Best Answer

Related Solutions

[Math] What distribution models number of trials needed for given number of successes and success rate

Find the binomial probability mass function of a binomial random variable

Related Question