Sampling – Covariance for Sampling Without Replacement in Survey Sampling Studies

covariancesamplingself-studysurvey-sampling

Suppose I have numbers 1,2… 10 and I sample 5 from them randomly without replacement noted as $X_1, X_2, X_3, X_4, X_5$ What is $Cov(X_i,X_j)$ for $i \not=j$

So $Cov(X_i,X_j)=E(X_iX_j)-E(X_i)E(X_j)$

I consider that any $X_i$ treated on its own is $Uniform(10)$ so $E(X_i)=E(X_j)=11/2$

For $E(X_iX_j)$ I am a bit stuck.

I first considered getting the joint $f(x_i,x_j)=f(x_i|x_j)f(x_j)=\frac{1}{(n-1)n}=1/90$ but this seems wrong.

Best Answer

Problems in sampling from finite populations without replacement can usually be solved in terms of the sample inclusion probabilities $\pi(x)$, $\pi(x,y)$, etc.


Let $\pi(x) = \Pr(X_1 = x)$ for any $x$ in the population $\mathcal P$ (with $n=10$ elements) and let $\pi(x,y)=\Pr((X_1,X_2)=(x,y))$ for any $x$ and $y$ in $\mathcal P$. By definition of expectation,

$$E(X_1) = \sum_{x\in\mathcal P} \pi(x)x\tag{1}$$

and

$$E(X_1X_2) = \sum_{(x,y)\in\mathcal{P}^2} \pi(x,y)x y \tag{2}.$$

For this sampling procedure $X_1$ has equal chances of being any of the $n$ elements of $\mathcal P $, whence $$\pi(x)=\frac{1}{n}\tag{3}$$ for all $x$. Because sampling is without replacement, only the pairs $(x,y)$ with $x\ne y$ are possible, but all $n(n-1)$ of those are equally likely. Therefore

$$\pi(x,y) = \left\{\matrix{\frac{1}{n(n-1)} & x\ne y \\ 0 & x=y} \right.\tag{4}$$


That's the general result. For any particular population, you just have to do the arithmetic implied by formulae $(1)$ through $(4)$.

Suppose now that $\mathcal{P} = \{1,2,\ldots, n\}$. Formulae $(1)$ and $(3)$ give

$$E(X_1) = \sum_{i=1}^{n} \frac{1}{n} i = \frac{n+1}{2}$$

while formulae $(2)$ and $(4)$ give

$$\eqalign{E(X_1X_2) &= \sum_{i,j=1;\, i\ne j}^{n} \frac{1}{n(n-1)} i j \\ &= \frac{1}{n(n-1)}\left(\sum_{i=1}^{n}\sum_{j=1}^{n} i j - \sum_{i=1}^{n} i^2\right)\\ &= \frac{1}{n(n-1)}\left(\sum_{i=1}^{n}i\ \sum_{j=1}^{n} j - \sum_{i=1}^{n} i^2\right)\\ &= \frac{1}{n(n-1)}\left(\left(\frac{n(n+1)}{2}\right)^2 - \frac{n(1+n)(1+2n)}{6}\right) \\ &= \frac{3n^2 + 5n + 2}{12}. }$$

Because there is no distinction among any of the $X_i$, these results hold for any $i \ne j$, not just $i=1$ and $j=2$. In particular,

$$\operatorname{Cov}(X_i,X_j) = E(X_iX_j) - E(X_i)E(X_j) = \frac{3n^2 + 5n + 2}{12} - \left(\frac{n+1}{2}\right)^2 = -\frac{n+1}{12}.$$


When $n=10$, the covariance of $X_i$ and $X_j$ is $-11/12 \approx -0.917$. As a check, here is a simulation of a million such samples (using R):

> cov(t(replicate(1e6, sample.int(10, 5))))

The output is the $5\times 5$ covariance matrix of $(X_1, \ldots, X_5)$. Because this is a simulation the output is random; but because it's a largish simulation, it's reasonably stable from one run to the next. In the first simulation I did, the off-diagonal elements of this covariance matrix ranged from $-0.9277$ to $-0.9080$ with a mean of $-0.9169$: narrowly spread around $-11/12$ as one would expect.

Related Question