Solved – Proportion Test: Z-test vs bootstrap / simulation – different results

bootstraphypothesis testinginferenceproportion;z-test

I'm learning hypothesis testing, and going through the following example:

The CEO of a large electric utility claims that 80 percent of his 1,000,000 customers are very satisfied with the service they receive. To test this claim, the local newspaper surveyed 100 customers, using simple random sampling. Among the sampled customers, 73 percent say they are very satisified. Based on these findings, can we reject the CEO's hypothesis that 80% of the customers are very satisfied? Use a 0.05 level of significance.

I'm getting different results when calculating the p-value using the one-sample z-test compared to bootstrapping method in python.

Z-Test Method:

σ = sqrt [(0.8 * 0.2) / 100] = sqrt(0.0016) = 0.04
z = (p – P) / σ = (.73 – .80)/0.04 = -1.75

Two-tailed test so P(z < -1.75) = 0.04, and P(z > 1.75) = 0.04.

Thus, the P-value = 0.04 + 0.04 = 0.08.

Bootstrapping method (in Python):

The general method is to take a random sample of size 100 from the population (1,000,000) of which 80% are satisfied

repeat 5000 times:
    take random sample of size 100 from population (1,000,000, 80% of which are satisfied)
    count the number of satisfied customers in sample, and append count to list satisfied_counts
calculate number of times that a value of 73 or more extreme (<73) occurs. Divide this by the number of items in satisfied_counts

Since it's a two-tailed test, double the result to get the p-value.

With this method, p-value 0.11.

Here is the code:

population = np.array(['satisfied']*800000+['not satisfied']*200000)     # 80% satisfied (1M population)
num_runs = 5000
sample_size = 100
satisfied_counts = []

for i in range(num_runs):
    sample = np.random.choice(population, size=sample_size, replace = False)
    hist = pd.Series(sample).value_counts()
    satisfied_counts.append(hist['satisfied'])

p_val = sum(i <= 73 for i in satisfied_counts) / len(satisfied_counts) * 2

How come the two results are different? Any help / point in the right direction appreciated!

Best Answer

Probability computations. In your first computation your normal approximation is not quite correct. First, because it shouldn't be expected to give more than two digits of accuracy. Second, because you do not use a continuity correction.

Binomial: The P-value can be based on $X \sim \mathsf{Binom}(n=100, p=.8),$ so that $P(X \le 73) = 0.0558,$ as computed in R.

pbinom(73, 100, .8)
[1] 0.05583272

It seems natural to me to frame this as a one-sided test of $H_0: p = .8$ against $H_a: p < .8$ because the CEO would hardly be disappointed at a higher satisfaction rate. In that formulation, the P-value is 0.0558. If you want to test against the two-sided alternative $H_0: p \ne .8,$ then the P-value is 0.1117 (double the P-value for the one-sided test). In either case, the test is not significant at the 5% level.

Normal approximation: The figure below shows why a normal approximation should use a continuity correction, approximating the P-value of a one-sided test as: $$P(X \le 73) = P(X \le 73.5) = P\left(\frac{X - 80}{\sqrt{.8(.2)(100)}} \le \frac{73.5 - 80}{4}\right)\\ \approx P(Z \le -1.625) = 0.0521,$$ where $Z$ is standard normal.

pnorm(-1.625)
[1] 0.05208128

Then the P-value of a two-sided test is .1042. These values are about as good as you can get with a normal approximation. (If you use printed tables rather than software, you may lose a little more accuracy due to rounding.)

Without the continuity correction, the area under the curve between 73.0 and 73.5 is ignored, while the entire probability $P(X = 73)$ is approximated by the area under the curve between 72.5 and 73.5.

enter image description here

Monte Carlo approximations. Definitions of 'bootsrapping' seem to differ from one field of application to another. Your Python code very nearly approximates the correct P-value, essentially by sampling repeatedly from the appropriate hypergeometric (approximately binomial) distribution--in a way I would call Monte Carlo approximation, instead of bootstrapping. But by any name, the procedure seems to be correct.

Notes: (1) In effect, your code is not much different from the sampling done in the following brief R program. [Your code and mine both have the slight advantage of sampling 100 customers without replacement from among a million. But for such a small sample from such a large population, there is hardly any difference between sampling with replacement (binomial) and without (hypergeometric).]

set.seed(1234)
x = rhyper(5000, 800000, 200000, 100)
mean(x <= 73)
[1] 0.0554
2*sd(x <= 73)/sqrt(5000)
[1] 0.006246138            # aprx 95% margin of simulation error is 0.006

With more iterations, as suggested by @MartijnWeterings:

set.seed(4321)
m = 10^6
x = rhyper(m, 800000, 200000, 100)
mean(x <= 73)
[1] 0.05603
2*sd(x <= 73)/sqrt(m)
[1] 0.0004599595

(2) Using the hypergeometric distribution, R finds the one-sided P-value to be 0.05582 (compared with the binomial 0.05583 from the beginning of this Answer).

phyper(73, 800000, 200000, 100)
[1] 0.05582364
Related Question