Binomial Distribution – How to Determine Number of Trials Required to Achieve Desired Odds

binomial distributionp-valuer

This question may have been asked before, but I couldn't find it. So, here goes.

From about 3000 data points that can be characterized as "wins" or "losses" (binomial), it turns out that there are 52.8% dumb luck wins. This is my dependent variable.

I also have some additional data that may help in predicting the above wins that could be considered an independent variable.

The question I'm trying to answer is:

If my independent variable can be used to predict 55% wins, how many trials are required (giving me 55% wins) for me to be 99% sure that this wasn't dumb luck?

The following R code is purposely hacky so I can see everything that is happening.

#Run through a set of trial sizes
numtri <- seq(2720, 2840, 1)

#For a 52.8% probability of dumb luck wins, and the current trial size,
#calculate the number of wins at the 99% limit
numwin <- qbinom(p=0.99, size=numtri, prob=0.528)

#Divide the number of wins at the 99% limit by the trial size to
#get the percent wins.
perwin <- numwin/numtri

#Plot the percent wins versus the trial size, looking for the
#"predicted" 55% value. See the plot below.
plot(numtri, perwin, type="b", main="Looking for 0.55")
grid()

#Draw a red line at the 55% level to show which trial sizes are correct
abline(h=0.55, lwd=2, col="red")

#More than one answer? Why?........Integer issues
head(numtri)
head(numwin)
head(perwin)

From the graph, the answer is: 2740 <= numtri <= 2820

As you can guess, I'm also looking for the required number of trials for 56% wins, 57% wins, 58% wins, etc. So, I'll be automating the process.

Back to my question. Does anyone see a problem with the code, and if not, does anyone have a way to cleanly sneak up on the right and left edges of the "answer"?

enter image description here

Edit (02/13/2011) ==================================================

Per whuber's answer below, I now realize that to compare my alternate 55% wins (which came from the data) with my 52.8% "dumb luck" null (which also came from the data), I have to deal with the fuzz from both measured values. In other words, to be "99% sure" (whatever that means), the 1% tail of BOTH proportions needs to be compared.

For me to get comfortable with whuber's formula for N, I had to reproduce his algebra. In addition, because I know that I'll use this framework for more complicated problems, I bootstrapped the calculations.

Below are some of the results. The upper left graph was produced during the bootstrap process where N (the number of trials) was at 9000. As you can see, the 1% tail for the null extends further into the alternate than the 1% for the alternate. The conclusion? 9000 trials is not enough. The lower left graph is also during the bootstrap process where N was at 13,000. In this case, the 1% tail for the null falls into an area of the alternate that is less than the required 1% value. The conclusion? 13,000 trials is more than enough (too many) for the 1% level.

The upper right graph is where N=11109 (whuber's calculated value) and the 1% tail of the null extends the right amount into the alternate, aligning the 1% tails. The conclusion? 11109 trials are required for the 1% significance level (I am now comfortable with whuber's formula).

The lower right graph is where I used whuber's formula for N and varied the alternate q (at both the 1% and 5% significance levels) so I could have a reference of what might be an "acceptable zone" for alternates. In addition, I overlaid some actual alternates versus their associated number of data points (the blue dots). One point falls well below the 5% level, and even though it might provide 72% "wins", there simply aren't enough data points to differentiate it from 52.8% "dumb luck". So, I'll probably drop that point and the two others below 5% (and the independent variables that were used to select those points).

enter image description here

Best Answer

I wonder what it means to be "99% sure."

The code seems to equate "dumb luck" with $p$ = 52.8% probability of wins. Let's imagine conducting $N$ trials, during which we observe $k$ wins. Suppose, for instance, $N$ = 1000 and you observe $k$ = 530 wins. That's greater than the expected number $p N$ = 528, but it's so close that a sceptic would argue it could be due to chance: "dumb luck." So, you need a test. The best test uses a threshold $t$ (which depends, obviously, on $N$ and $p$ and will be a bit greater than $N p$). The chance--still assuming $p$ = 52.8%--of having more than $t$ wins needs to be so low that if you do observe more than $t$ wins, both you and a reasonable sceptic will allow that $p$ is probably greater than 52.8%. This chance is usually called $\alpha$, the "false positive rate." Your first task is to set this threshold. How will you do it? That is, what value of $\alpha$ do you want to use? Does "99% sure" mean that $1 - \alpha$ should equal 99%?

The second task is to consider your chances of beating the threshold if in fact $p$ equals 55% (or whatever you choose). Does "99% sure" mean you want to have a 99% chance of having more than $t$ wins in $N$ trials when $p$ = 55%?

So, it seems to me there are two chances involved here: one in determining the test to distinguish "dumb luck" and another one that will subsequently determine how many trials you need. If this analysis correctly reflects the situation, the problem is easily solved. Note a few things:

  1. There are plenty of uncertainties about this procedure and some of them are not captured by the statistical model. Therefore we shouldn't aim for too much precision. There's really no difference between, say, 2742 and 2749 trials.

  2. In case it turns out hundreds or thousands of trials are needed, the (usual) Normal approximation to the Binomial distribution will be highly accurate. Point #1 justifies using an approximation.

  3. Consequently, there is a simple formula to compute the number of trials directly: no need to explore hundreds of possibilities!

The formula alluded to in #3 is this: in $N$ trials with probability $p$ of winning, the standard error of the proportion of wins is

$$s = \sqrt{p(1-p)/N}.$$

The threshold therefore should be

$$t(N) = N(p + z_\alpha s)$$

where $z_\alpha$ is the upper $100 - 100\alpha$ percentile of the standard normal distribution. If the true probability of a win actually is $q$ (such as $q$ = 55%), the chance of exceeding $t(N)$ in $N$ trials equals

$$\Phi\left(\frac{(q-t(N)/N)\sqrt{N}}{\sqrt{q(1-q)}}\right) = \Phi\left(\frac{(q-p)\sqrt{N} - z_\alpha \sqrt{p(1-p)}}{\sqrt{q(1-q)}}\right) $$

where $\Phi$ denotes the standard normal cumulative distribution function. Equating this with your desired degree of certainty (such as $\beta$ = 99%) and solving this equation for $N$ (rounding up to the nearest integer) gives the solution

$$N = \left(\frac{z_\alpha \sqrt{p(1-p)} + z_\beta \sqrt{q(1-q)}}{q-p}\right)^2$$

(assuming $q \gt p$).


Example

For $p$ = 52.8%, $q$ = 55%, $\alpha$ = 1%, and $\beta$ = 99%, which give $z_\alpha$ = $z_\beta$ = 2.32635, I obtain $N$ = 11109 with this formula. In 11109 trials you would expect to win about 5866 times by "dumb luck" and maybe as many as $t$ = 5988 times due to chance variation. If the true winning percentage is 55%, you expect to win about 6110 times, but there's a 1% chance you will win 5988 or fewer times, and thereby (due to "bad luck") not succeed in showing (with 99% confidence) that "dumb luck" isn't operating.

R code for the example

#
# Specify inputs.
#
p <- 52.8/100     # Null hypothesis
q <- 55/100       # Alternative
alpha <- 1/100    # Test size
beta <- 99/100    # Chance of confirming the alternative
#
# Compute number of trials.
#
s.p <- sqrt(p*(1-p))
s.q <- sqrt(q*(1-q))
z.alpha <- qnorm(1 - alpha)
z.beta <- qnorm(beta)
delta <- q - p
N <- ceiling(((z.alpha * s.p + z.beta * s.q)/delta)^2)
#
# Explain.
#
t <- floor(N*(p + z.alpha*se.p))   # Threshold
m <- round(N*p)                    # Expectation (null)
k <- round(N*q)                    # Expectation (alternate)
size <- pnorm((t + 1/2 - N*p)/(s.p * sqrt(N))) # Actual test size
sureness <- 1 - pnorm((N*q - (t + 1/2))/(s.q * sqrt(N)))
print(list(N=N, threshold=t, expectation=m, alternate=k,
  size=size, sureness=sureness))
Related Question