Statistical Sampling – Proving E[\hat{\tau}_D] = P(n_D > 0)\tau_D and \vert E[\hat{\tau}_D] – \tau_D\vert \leq \tau_D(1-\frac{N_D}{N})^n

conditional-expectationdiscrete-distributionssampling

Consider the following double sampling scheme:

We have a population of size $N$ with variable of interest $y_i$ for each $i \in \{1,\dots,N\}$, and (fixed) subpopulation $D$ of size $N_D$. Let $S$ denote a Simple Random Sample of size $n$ of the population.

For $S_D := S \cap D$ of (random) size $n_D$, we want to estimate the total $\tau_D=\sum_{i \in D} y_i$. We will use the following estimator:

$$
\hat{\tau}_D =1_{\{n_D > 0\}}\frac{N_D}{n_D} \sum_{i \in S_D} y_i .
$$

I want to show the following two properties for $\hat{\tau}_D$:

(i) $E[\hat{\tau}_D] = P(n_D > 0)\tau_D$

(ii) $\vert E[\hat{\tau}_D] – \tau_D\vert \leq \tau_D(1-\frac{N_D}{N})^n$


For (i) I believe that if we condition of $n_D$, then $S_D$ will be a Simple Random Sample in $D$ where we can use that
$$
E[\frac{N_D}{n_D} \sum_{i \in S_D} y_i] = \tau_D
$$

i.e. the usual estimator of the population total is an unbiased estimator. Hence using that $E[X]=E[E[X \mid Y]]$
$$
E[\hat{\tau}_D]=E[1_{\{n_D > 0\}}E[\frac{N_D}{n_D} \sum_{i \in S_D} y_i \mid n_D]]
$$

where we have moved the known random variable $1_{\{n_D > 0\}}$ out of the conditional expectation. Now it is very tempting to split this expectation into two:
$$
E[1_{\{n_D > 0\}}E[\frac{N_D}{n_D} \sum_{i \in S_D} y_i \mid n_D]] \overset{!}{=} E[1_{\{n_D > 0\}}]E[\frac{N_D}{n_D} \sum_{i \in S_D} y_i \mid n_D] = P(n_D > 0)\tau_D
$$

but this requires that these random variables are independent which doesn't seem right to me. Is it true or do we use something else?

As for (ii) we have using (i)

$$
\vert E[\hat{\tau}_D] – \tau_D\vert = \vert 1-P(n_D > 0)\vert \vert \tau_D \vert = P(n_D = 0) \tau_D,
$$

but I am not sure how to proceed from here. Could we compute $P(n_D = 0)$? We don't know the distribution of $n_D$ or anything… Any help is appreciated!


Update:

For (ii) I believe the idea is to get the upper estimate going from the usual Simple Random Sample to one with replacement where the inclusion events are i.i.d. So we want an inequality like this:
$$
P(n_D = 0)=P(i_1,\ldots,i_n \in D^c) \overset{!}{\leq} \prod_{j=1} P(i_j \in D^c) = (1-\frac{N_D}{N})^n.
$$

But how can we show the ! step?

Best Answer

For (ii) I believe the idea is to get the upper estimate going from the usual Simple Random Sample to one with replacement where the inclusion events are i.i.d.

Is the sample $S$ drawn with or without replacement? You said that it's a simple random sample (which requires replacement), but the sentence above seems to imply that you're sampling without replacement.

Could we compute $P(n_D=0)$? We don't know the distribution of $n_D$ or anything.

I'm going to assume that you're sampling without replacement. In that case, as long as $n \leq N_D$ and $n \leq N-N_D$, we know that $n_D$ follows a hypergeometric distribution. We thus have:

$$ P(n_D=0) = \frac{{N_D \choose 0}{N-N_D \choose n} }{{N \choose n}}=\frac{(N-N_D)!(N-n)!}{(N-N_D-n)!N!} $$

Related Question