Solved – Inclusion probabilities for a survey with unequal probability of selection

estimationprobabilitysamplingsurvey

I am attempting to run a simulation in R that requires me to take a survey where the probability of selection for each individual is not equal.

I have a factor covariate $x$ which has levels $x_1, x_2, …, x_5$ that is known for every member of the population before sampling, and I wish to give individuals a probability of selection based off their $x$, so that individuals with $x=x_1$ have the lowest chance of selection, and individuals with $x=x_5$ have the highest. So let's suppose $x_1 =1, x_2=2,…, x_5=5$

Suppose I therefore let $\pi_i = \frac{x_i}{\sum_{i=1}^N x_i}$ where $\pi_i$ is the chance of selection for person $i$.

With this in mind, I need to derive the probability of being included in a sample of size $n$ from the population of size $N$. In simple random sampling, this is just $\frac{n}{N}$, but I'm unsure how to derive it in this instance.

Likewise, for the purpose of variance estimation, I need to derive the joint inclusion probability for two individuals $i$ and $j$, the probability both are included in the sample. Again, in simple random sampling this is just $\frac{n(n-1)}{N(N-1)}$, but I'm not sure what it is here.

The reason I need the inclusion probabilities is so that I can find the sample weights, which are needed for estimation of my total $Y$ and its variance.

Best Answer

Converting my comment to an answer.

You don't have enough information to answer the question specifically. Rather, it depends on the distribution of $x$ in the sampling frame (which we may assume to be the whole population).

Trivially, if you have a target sample of $n=$(say)$300$. That means that that sample will be apportioned 5:4:3:2:1 (15 parts) to members with $x=5$, $x=4$, $\ldots$, and $x=1$ respectively. (so 100, 80, 60, 40, 20 respectively). Once you know the distribution of $x$ in the population, you just calculate the SRS sampling probability 5 times. For instance, if $X=1$ has 1,000 people in the population, then the sampling probability is $20/1000 = 0.005$ and so on and so forth.

That is the answer to the problem along with a description of the missing information.

Related Solutions

Sampling – How to Apply Brewer’s Method for Sampling with Unequal Probabilities When n>2

My reading of Brewer's procedure is as follows.

To sample the first unit, set $r=n$ and compute $$ D_1 = \sum_{i=1}^N\frac{P_i(1-P_i)}{(1-nP_i)} $$ Then sample the first unit with probability $$P^{(1)}_i = \frac{P_i (1 - P_i)}{D_1 (1-nP_i)}$$Let the index of the first sampled unit be $I_1$.
To sample the second unit, set $r=n-1$ and compute $$ D_2 = \sum_{i\in\{1,\ldots,N\}, i\notin \{ I_1 \} }\frac{P_i(1-P_i)}{(1-(n-1)P_i)} $$ Then sample the second unit with probability $$P^{(2)}_i = \frac{P_i (1 - P_i)}{D_2 (1-(n-1)P_i)}$$ Let the index of the first sampled unit be $I_2$.
etc.

To sample the $k$-th unit, set $r=n-k+1$ and compute $$ D_k = \sum_{i\in\{1,\ldots,N\}, i\notin \{ I_1, \ldots, I_{k-1} \} }\frac{P_i(1-P_i)}{(1-(n-k+1)P_i)} $$ Then sample the $k$-th unit with probability $$P^{(k)}_i = \frac{P_i (1 - P_i)}{D_k (1-(n-k+1)P_i)}$$

B&H 83 refers to Brewer (1975) in Australian J of Statistics which I don't see any way of getting.

Solved – Difference between calculated inclusion probability and what is returned by sampling function

Sampling with replacement is boring. Sampling without replacement is very interesting. That's why the authors of library(sampling) restricted their attention to sampling WOR. So inclusionprobabilities() takes the baseline rates in your y, and figure out what would the inclusion probabilities be should a proper unequal probability WOR sampling algorithm applied to these numbers.

Looking at the source code, I imagine that your snippet of code reproduces the "regular" case of inclusionprobabilities() when none of the inclusion probabilities exceed 1. In that regular case, the inclusion probabilities are simply the input probabilities scaled up so that their sum is equal to the target sample size. Note that inclusion probabilities refer to the units on the frame, rather than the specific samples, as your code does.

For sampling with replacement, I believe your calculations are correct, in that probability of each pair is the product of probabilities. Then what inclusionprobabilities refers to are the sums across all rows where either X1 or X2 are equal to 1, 2, 3 or 4 (the indices of the original units):

for(k in 1:4) {
  print(sum(df$p[df$X1==k|df$X2==k]))
}

This is to say, unit 1 appears in 1.8% of the samples, while unit 3, in 77.3% of the samples. However, these numbers sum up neither to 1 (as base probabilities should) nor to 2 (as correct inclusion probabilities should), and so they are kinda weird, in the end.

Best Answer

Related Solutions

Sampling – How to Apply Brewer’s Method for Sampling with Unequal Probabilities When n>2

Solved – Difference between calculated inclusion probability and what is returned by sampling function

Related Question