MATLAB: Sampling some elements without replacement out of big population, based on a probability distribution

sampling without replacement based on probability distribution

Hello,
Given population U=1:20, The associated probability series is related to the importance of each element: p=[0.01 0.05 0.11 0.08 0.02 0.02 0.01 0.055 0.1 0.2 0.05 0.05 0.01 0.03 0.025 0.06 0.04 0.05 0.01 0.02] Based on this probability distribution I want to sample 10 numbers from population U without replacement (one number should be selected at most once).
The best codes I could composed is this one, but it is a code that results in sampling with replacement U = 1:20; p=[0.01 0.05 0.11 0.08 0.02 0.02 0.01 0.055 0.1 0.2 0.05 0.05 0.01 0.03 0.025 0.06 0.04 0.05 0.01 0.02]; n = 10; c = cumsum([0,p(:).']); c = c/c(end); [~,i] = histc(rand(1,n),c); r = v(i); % map to v values
A next code I tried is U = 1:20; p=[0.01 0.05 0.11 0.08 0.02 0.02 0.01 0.055 0.1 0.2 0.05 0.05 0.01 0.03 0.025 0.06 0.04 0.05 0.01 0.02]; %to get the cumulative distribution in reversed order q=flip(p); r=cumsum(q); s=[0 r];
n = 10;%number of columns that should be selected X = zeros(n,1);%initiate list to save selected columns
%draw 10 uniform numbers, each one represents an arbitrary cumulative distribution x = rand(n,1);
%link the cdf value depending on its range to the right element of the populatione for j=1:size(U,2)-1 X(x>=s(j) & x<s(j+1))=20-j; end
But this code gives me problems at the last line and I do not know how to cover the situation when cum probabiliy equals one.
Can somebody give me some feedback/suggestion?
Thank you in advance!

Best Answer

Your question is mathematically inconsistent. Take a very simple example, a population of 2: [1,2], and you want to generate 1 with probability 3/4 and 2 with probability 1/4.

If you draw 2 samples without replacement you either get [1,2] or [2,1], so the prescribed probabilities of 3/4 and 1/4 never meet.

Now you could simulate the real-life drawing by a loop:

U = 1:20; 
p=[0.01 0.05 0.11 0.08 0.02 0.02 0.01 ...
   0.055 0.1 0.2 0.05 0.05 0.01 0.03 ...
   0.025 0.06 0.04 0.05 0.01 0.02]; 
n = 10; 
nset = 10000;
% without replacement
ik = zeros(nset,n);
for i=1:nset
    pk = p(:).';
    for k=1:n
        c = cumsum([0,pk]);
        c = c/c(end);
        [~,d] = histc(rand,c);
        ik(k) = d;
        pk(d) = 0;
    end
end
wor = U(ik);

This give one set of drawing n element without replacement. Repeat it as long as you like. But there is no warranty the appearance is the prescribed P.

For example the element #10 has prescribed probability of 0.2, however it can appears at mots once in when drawing a sequence of 10 without replacement, so the probability the element #10 appears can never goes above 1/10 = 0.1, which is not 0.2 you want it to be. (The histogram plot in the subsequent comment bellow also confirms this, look at bin #10)