Bruce's answer is great. I'd like to provide another way of interrogating whether the results you've observed are reasonable. It's easy to look at a p-value and think it's "wrong" with respect to our intuitions about the observed data and our model.
It might help to reframe this by thinking about what data our model would generate under the null hypothesis. As whuber pointed out, gender bias in hiring is a complex topic, so I'm referring here to "number of heads", as in the number of coin flips that come up heads. However in principle the same issues would apply to any binomial model given appropriate assumptions are met.
First, let's simulate what number of heads we get if we flip 16 coins in a row, and repeat that simulation 10,000 times. What's the distribution of results, and where does 2 lie on that distribution?
a <- replicate(10000, rbinom(1, size = 16, prob = 0.5))
hist(a,
breaks = "FD",
xlab = "Number of heads",
main = "Histogram of number of heads when n=16, p=0.5"
)
abline(v = 2, lty = "dashed")
2 is present in our simulated data but at a pretty low frequency. Therefore .2% seems at least in the right ballpark. Bear in mind, we're only doing 10,000 replicates, so there will of course be error.
Now, let's simulate what number of heads we get when simulating 1150 flips, repeat that process 10,000 times, and visualise the distribution along with your observed value of 350:
b <- replicate(10000, rbinom(1, size = 1150, prob = 0.5))
hist(b,
breaks = "FD",
xlab = "Number of heads",
main = "Histogram of number of heads when n=1150, p=0.5"
)
abline(v = 350, lty = "dashed")
Huh. 350 isn't even visible on the distribution unless we manually adjust the x-axis!
## in fact 350 isn't visible unless we set xlim
hist(b, breaks = "FD",
xlab = "Number of heads", xlim = c(340, max(b) * 1.1),
main = "Histogram of number of heads when n=1150, p=0.5"
)
abline(v = 350, lty = "dashed")
This shows that for a binomial distribution with $p=0.5$ and $n=1150$, $x=350$ is a really weird result! Therefore an extremely small p-value isn't surprising. I think you would need in the range of 1e40
simulations to observe one value that extreme, in fact...
Best Answer
I assume each subject is given three samples of beer to taste, two of them made according to the current recipe and one according to the new recipe. The subject tries to identify the one that tastes different from the other two.
Your null hypothesis is that the new recipe does not taste different. Suppose $n$ subjects taste independently in this way, and $X$ of of them make a correct identification. Then under the null hypothesis $X \sim \mathsf{Binom}(n, 1/3).$
If there are $n = 10$ tasters and $X = 8$ of them make correct identifications, then the P-value of the test is $P(X \ge 8) = 0.0034 < 0.01$ and you have good evidence to reject the null hypothesis at the 1% level, concluding that the new recipe tastes different. (Computation in R statistical software.)
This is an example of an exact binomial test. For larger $n,$ you might use a normal approximation to the binomial distribution to find the P-value. In a more complex design you might use a chi-squared approximation.
Notes: (1) Experience in food and beverage industries has shown that it is best to use professional tasters for such tests, especially if $n$ has to be small. Randomly chosen 'beer lovers' may vary greatly in their tasting ability, some unable to make distinctions immediately under test conditions that may prove to be important over the long run. (2) If beer samples differ in appearance (color, clarity, etc.) and taste is the important issue, then it may be necessary to blindfold subjects or to serve samples in opaque mugs under dim lighting.