Solved – Probability of overlap between independent samples of different sizes

probability

Today, I was trying to use probabilities to make an 'everyday' (and not very important) decision… but I found that I'd forgotten how to approach this sort of problem (if I am correct in believing that I ever knew how to deal with a problem like this).

Here is the form of the problem I was trying to solve:

There are Q objects in a container, each with its own unique color.
Agent1 takes a sample G of M unique objects out of the container, and then puts those M objects back into the container.

If Agent2 takes a sample V of L unique objects out of the container, then…

  1. What is the probability that the larger of the sets G and V will contain the smaller of these sets?

  2. What is the probability that the sets G and V will have an overlap of T elements?

To give you a rough idea of what I was trying to figure out, I was thinking:
"Hmm… There are 10,000 'things' to choose from, and I assume (for simplicity!) that each one has an equal chance of being chosen. I've picked 900 'things'. My friend picked 1500 things… and he picked 890 things in common with me. How likely is it that my friend peeked at my 'picking' history, and copied me?"

Best Answer

This is kind of similar to capture-recapture (or mark-and-recapture) sampling, but instead of assuming random selection and trying to estimate the population size (you know it already), you want to see if the recapture is consistent with resampling 'at random'. Your problem turns out to be easier.

So imagine you have an urn full of white balls. When you sample yours, you magically turn them black before putting them back. So now the urn has (otherwise identical) black and white balls.

Your friend draws his sample, and your interest is "does he get 'too many' black balls?". If he's sampling at random, the number of black balls in his sample follows a hypergeometric distribution.

You can assess the probability of getting as many as he got, or more (i.e. a sample at least as unusual as his) by finding the upper tail probability from the hypergeometric. In large samples like yours, you could use a normal approximation (I'd be inclined to use a continuity correction).

To formally test whether it could have happened by chance, you'd compare that upper tail probability with your favorite significance level.

This tells you the probability of a result at least as weird as he got if he wasn't copying you. To compute the actual probability you ask about:

"Hmm... There are 10,000 'things' to choose from, and I assume (for simplicity!) that each one has an equal chance of being chosen. I've picked 900 'things'. My friend picked 1500 things... and he picked 890 things in common with me. How likely is it that my friend peeked at my 'picking' history, and copied me?"

- which flips the conditioning around - would require taking a Bayesian approach. Which means you need a prior probability that he copied you.

Using the numbers in your question, to look at the upper tail probabilities mentioned before, the numbers you give already start to look a little bit suspicious at 152 items in common:

> phyper(152,900,10000-900,1500,lower.tail=FALSE)
[1] 0.04497392

quite suspicious at by 160:

> phyper(160,900,10000-900,1500,lower.tail=FALSE)
[1] 0.007129847

And you can certainly rule out random selection producing results like that well before it gets as high as 890.

That doesn't automatically imply that your friend copied you - perhaps there's some other source of nonrandomness. It just says 'this didn't just happen by chance'.

The give the numbers in a particular instance, the probability that 'the larger includes the smaller' is just an extreme-tail hypergeometric probability. This will generally be very small, unless there are hardly any items in the smaller sample.

Related Question