Chose a random element from a list while traversing it without knowing the size of it

probabilityprogramming

I have a list $A$ (that I can only traverse from start to finish, no random access), and some of its elements (I don't know how many) respect the property $P$; I have to traverse $A$ and chose at random an element that respects $P$ (every element that respects $P$ should have the same chance). It is possible that no such element exists. As a constrain, I can't just create a separate list of $P$-respecting elements while I traverse A checking for $P$ (but I still can memorize a fixed number of those elements).

The simplest solution is probably to traverse $A$, find the number $n$ of $P$-respecting elements, then chose a random number $r$ in $[1,n]$ and traverse again $A$ until I find the $r$ th $P$-respecting element; but traversal is costly, so I would prefer to do it one time if possible.

The best approach I found is: when I find the first $P$-respecting element, the algorithm memorizes it in variable $c$. If I find another $P$-respecting element, the algorithm can randomly decide to overwrite c with it, with probability $Q$. This way, every $P$-respecting element has a chance to be chosen. The problem is that, if $Q$ is a constant, later occurring elements have a better chance to be chosen. Is there some way to progressively tune Q so that all the elements have an equal chance, without having to know how much $P$-respecting elements exists beforehand? Alternatively, does a different, better approach to my problem exists?

I'm not sure that this site was a better choice than stack-overflow for the question, but the "choosing the right $Q$" part seemed more suitable for here.

Best Answer

When you find the $kth$ element with property $P$, give it a $\frac{1}{k}$ probability of being selected (and replacing the previously selected element). This involves only keeping track of how many $P$ elements you have seen so far.

If you have $n>0$ $P$ elements, you will always (provisionally) choose the first one.

When you reach the second $P$ element, it has a $\frac12$ chance of replacing the first one. So at this point each of the first two elements have a $\frac12$ chance of being (provisionally) selected.

When you get to the third $P$ element, it gets a $\frac13$ chance of being selected. If it is not selected, you keep the previously selected element. But the first two elements have an equal chance (by symmetry) of being selected; and you've given $\frac13$ of the probability to the third element, and so the first two elements are splitting a $\frac23$ chance of being selected--hence $\frac13$ each.

You can continue this argument, perhaps making it formal by induction.

Related Question