Sampling – Understanding Notation in Horvitz-Thompson Estimator

estimatorsrandom variablesamplingterminologyunbiased-estimator

I am a bit confused about the terminology used in the context of sampling of populations. The Horvitz-Thompson estimator, as well as the Hansen-Hurwitz estimator, for example, are examples of estimators introduced to deal with various sampling methods.

However, I don't understand why they are considered to be estimators in the strictest sense of the word. An estimator is a function that maps the sample space to a set of sample estimates, so to speak. This means that if the following function (the Horvitz-Thompson estimator) is an estimator

$\hat{Y}_{HT}=\sum_{i=1}^{n}\pi_{i}^{-1}Y_{i}$

Then $(Y_{1},…,Y_{n})$ is a random sample and therefore $Y_{i}$ is apparently a random variable. Nevertheless, in this context, the $Y_{i}$ are not treated as random variables (as they are, for example, in books on statistical inference). Precisely because they are not random variables, when proving the unbiasedness of the Horvitz-Thompson estimator, one has to introduce some kind of supporting random variable. Note (https://en.wikipedia.org/wiki/Horvitz%E2%80%93Thompson_estimator), for instance, the difference between the definition of the Horvitz-Thompson estimator when introduced formally, and the definition of the Horvitz-Thompson estimator when proving the unbiasedness of the estimator for the mean.

How is this tension solved?

Best Answer

You need to handle the population explicitly to get clearer notation, then we can work backwards.

Suppose we have a fixed population of size $N$, with $Y_1,\dots,Y_N$ being the (non-random) values of $Y$. The randomness comes when we select a sample of size $n$. Let $R_i$ be the indicator than observation $i$ was sampled, and write $\pi_i=E[R_i]$ for the probability that $i$ was sampled.

The population total is $T_Y=\sum_{i=1}^N Y_i$ and the Horvitz-Thompson estimator is $$\hat T_Y= \sum_{i=1}^N \frac{R_i}{\pi_i}Y_i.$$

Since the only randomness is in $R$, $$E[\hat T_Y] =E\left[\sum_{i=1}^N \frac{R_i}{\pi_i}Y_i\right]=\sum_{i=1}^N E\left[\frac{R_i}{\pi_i}\right]Y_i.$$

Now $E[R_i/\pi_i]=1$ by the definition of $\pi_i$ and you are done.

The problem with your original notation was that it hid the randomness. If you write $$\hat T= \sum_{i=1}^n \pi_i^{-1}Y_i$$ there now is randomness in both $\pi_i$ and $Y_i$, because $\pi_1$ now means the sampling probability for the first observation we sampled and $Y_1$ now means the value of the first observation we sampled.

When I don't want to explicitly introduce $N$ and $R$ I will usually write $$\hat T= \sum_{i\in\text{sample}} \pi_i^{-1}Y_i$$ which at least indicates that the $i$ depend on the sample. Or you could write $$\hat T= \sum_{i=1}^n \pi_{I_i}^{-1}Y_{I_i}$$ where $I_i$ means the population index of the $i$th value that was sampled.

But it's usually easier just to be explicit about the population if you want to do any inferential reasoning. Save the use of indices over the sample for computational formulas (if anything)