Solved – Inclusion probabilities for a survey with unequal probability of selection

estimationprobabilitysamplingsurvey

I am attempting to run a simulation in R that requires me to take a survey where the probability of selection for each individual is not equal.

I have a factor covariate $x$ which has levels $x_1, x_2, …, x_5$ that is known for every member of the population before sampling, and I wish to give individuals a probability of selection based off their $x$, so that individuals with $x=x_1$ have the lowest chance of selection, and individuals with $x=x_5$ have the highest. So let's suppose $x_1 =1, x_2=2,…, x_5=5$

Suppose I therefore let $\pi_i = \frac{x_i}{\sum_{i=1}^N x_i}$ where $\pi_i$ is the chance of selection for person $i$.

With this in mind, I need to derive the probability of being included in a sample of size $n$ from the population of size $N$. In simple random sampling, this is just $\frac{n}{N}$, but I'm unsure how to derive it in this instance.

Likewise, for the purpose of variance estimation, I need to derive the joint inclusion probability for two individuals $i$ and $j$, the probability both are included in the sample. Again, in simple random sampling this is just $\frac{n(n-1)}{N(N-1)}$, but I'm not sure what it is here.

The reason I need the inclusion probabilities is so that I can find the sample weights, which are needed for estimation of my total $Y$ and its variance.

Best Answer

Converting my comment to an answer.

You don't have enough information to answer the question specifically. Rather, it depends on the distribution of $x$ in the sampling frame (which we may assume to be the whole population).

Trivially, if you have a target sample of $n=$(say)$300$. That means that that sample will be apportioned 5:4:3:2:1 (15 parts) to members with $x=5$, $x=4$, $\ldots$, and $x=1$ respectively. (so 100, 80, 60, 40, 20 respectively). Once you know the distribution of $x$ in the population, you just calculate the SRS sampling probability 5 times. For instance, if $X=1$ has 1,000 people in the population, then the sampling probability is $20/1000 = 0.005$ and so on and so forth.

That is the answer to the problem along with a description of the missing information.