The probability of a sampled election to be wrong

combinatoricsprobabilityvoting-theory

Suppose that only a subset of people who wants to vote is allowed to. The sampling of voters is fair, and anyone has the same chance of being selected to vote.

Suppose $n$ people are allowed to vote, sampled from a very large population $P$ of interested voters such as $|P| \gg n$. There are only two candidates. For a given election result, what is the probability of it being correct? I.e. what is the probability the result would be the same if everybody was allowed to vote?

I wrote a python simulator that says that (using $n = 10000$) if one candidate wins with 52% of the votes, the chance of the election being right is 99.98%. Get better if the margin is bigger, worse if the margin is smaller. This is the output of one execution:

50 % : 7076 / 10181 , ( 69.50201355466064% )
51 % : 18932 / 19755 , ( 95.83396608453556% )
52 % : 20104 / 20108 , ( 99.98010741993237% )
53 % : 19913 / 19913 , ( 100.0% )
54 % : 20101 / 20101 , ( 100.0% )
  ... (lines omitted, all the ~20000 executions for each result nailed with 100%)
99 % : 19576 / 19576 , ( 100.0% )
100 % : 10191 / 10191 , ( 100.0% )

I am pretty sure there is formula to calculate it: for a given $n$ and result, what is the probability the result is correct?

What about considering $|P|$ as well?

EDIT:
It was implicit, but I want to make it explicit: every voter supports one or the other candidate, thus will not abstain.

Best Answer

After briefly reviewing sampling distribution, as K.defaoite suggested, I dismissed it because I got the feeling that it either was not going the right way to answer my question, or the road to get there would be too long. But it pointed me to binomial distribution, what pointed to hypergeometric distribution, and after some thinking, I believe I got the right answer. I would very much appreciate if you could verify it.

I know I framed the question in $n$ and victory margin $m$, but it is easier to answer considering the number of votes in favor (call it $x$) and the number of votes against (call it $y$), so: $$ x = n * m \\ y = n - x \\ x \ge y $$

A voter drawn at random from $P$ has a probability $p$ of supporting the winner. This means that the result is correct only if $p > 0.5$, otherwise the other candidate would have won if everybody was allowed to vote.

We don't know the probability $p$, but we know, from the election result, that out of $n$ voters drawn at random, $x$ voters supports the winner and $y$ voters do not.

Let $f_{x,y}(p)$ be the probability of a given $p$ yielding the known $x$ and $y$ election result. For every possible $p$, the ones with greatest $f_{x,y}(p)$ are the most likely to be the real proportion of voters in $P$ that supports the winner.

If we assume $f_{x,y}(p)$ is proportional to the likelihood of $p$ being the real support for the winner (and I think it is, I am just not sure how to prove it), then the probability of the election being wrong is given by:

$$ l(x,y) = \frac{\int_{0}^{0.5}f_{x,y}(p)\, dp}{\int_{0}^{1}f_{x,y}(p)\, dp} $$

Which is to say the more cases of $p < 0.5$ are capable of producing the know result of $x$ and $y$, bigger the chance of the election results being wrong, because it is only correct if $p > 0.5$.

Now we only need to know $f_{x,y}(p)$ to be able to calculate $l(x,y)$.

The case for $|P| \gg n $:

It wont make much of a difference for the proportion of supporters if 10 or 10,000 voters are removed out of 300,000,000, thus, for these cases, we can approximate the chance of each one of the $n$ voters to be a supporter of the correct winner as independent of each other, and use the binomial distribution, in which case $f_{x,y}(p)$ is given by:

$$ f_{x,y}(p) = \frac{(x+y)!}{x!y!} p^x (1-p)^y $$

By plugging into the definition of $l(x,y)$, we get:

$$ l(x,y) = \frac{\int_{0}^{0.5} p^x (1-p)^y\, dp}{\int_{0}^{1} p^x (1-p)^y \, dp} $$

The numerator bears striking resemblance for the incomplete beta function, as the denominator to the full beta function, so much that it can be written as: $$ l(x,y) = \frac{B(0.5; x+1,y+1)}{B(x+1,y+1)} $$

Now we can see a striking resemblance to the regularized incomplete beta function, so much that it can be written as: $$ l(x,y) = I_{0.5}(x+1, y+1) $$

Turns out that the regularized incomplete beta function is the CDF for the beta distribution, thus I think we can say the probability of a given result in a sampled election have its voters chosen from a population with some support level to the winner is given by the beta distribution.

In practice, $l(x,y) = 0.5$ for $x = y$, and quickly drops to 0 as the difference between $x$ and $y$ and their magnitude increases.

The case for small $|P|$

If you want to consider the statistical dependency between draws from $P$, you can instead use the PMF of the hypergeometric distribution as $f_{x,y}(p)$:

$$ f_{x,y}(p) = \frac{\binom{p|P|}{x} \binom{(1-p)|P|}{y}}{\binom{|P|}{n}} $$

Notice that this function is now discrete, and is only defined for where values of $p$ where $p|P| \in \mathbb{N}$. This means that you'll need to replace the integrals for summation over every valid discrete value of $p$ within the integration interval.

Related Question