Z-test and Chi squared test producing different p-values

hypothesis testingstatistical-inferencestatistics

Given that I'm doing A/B testing for conversion rate on two groups, where group A has 6000 samples of which 90 are conversions, and group B has 4000 samples of which 80 are conversions. I want to know if group B has a statistically higher conversion rate.

I seemingly get a different answer depending on if I use a Z-test or Chi squared test and alpha = 0.5. Z-test returns a p-value of 0.0327 whereas Chi squared gives a p-value of 0.058.

The problem originates from https://towardsdatascience.com/the-art-of-a-b-testing-5a10c9bb70a4 , and trying it on my own I get the same values as in the article. The author attempts to explain the discrepancy by saying the Z-test doesn't take into account that the random variable of the difference of the mean is restricted to [-1, 1] but I don't really follow.

I was under the impression that these tests are equivalent for this type of problem, so why do they return different p-values?

Thanks.

Edit: As @BruceET suspected I was doing a two sided chi squared test, which obviously doesn't give the same p-value as the Z-test (or T-test to be more accurate..) for proportions. As was also pointed out I wasn't clear in how i was estimating the variances which was another problem. The method used in the article I followed was Welch's T-test (i.e. T-test without pooling variances). If I use the "exact" variance=mean*(1-mean)*(1/n_A + 1/n_B) where the mean is over both A and B, the p-value is 0.29, exactly half of that of the Chi squared test. I suspect I'll get something close to it if I use a pooled variance, but not tried it.

Best Answer

I realize that this is not a direct answer to your question. However, using two fundamentally different procedures that I trust, I do not find any conflict in the results. [My guess is that your 'z-test' may be one-sided and your 'chi-squared test' two-sided.]

Data:

conv = c(90,80)
size = c(6000, 4000)
nonc = size - conv
MAT = rbind(conv,nonc)

MAT
     [,1] [,2]
conv   90   80
nonc 5910 3920

One-sided Fisher Exact test:

fisher.test(MAT, alt="less")

         Fisher's Exact Test for Count Data

data:  MAT
p-value = 0.03543
alternative hypothesis: 
  true odds ratio is less than 1
95 percent confidence interval:
 0.00000 0.97505
sample estimates:
odds ratio 
 0.7462279

One-sided test of $p_A = p_B$ against $p_A < p_B:$

prop.test(conv, size, alt="less")

        2-sample test for equality of proportions 
        with continuity correction

data:  conv out of size
X-squared = 3.2975, df = 1, p-value = 0.03469
alternative hypothesis: less
95 percent confidence interval:
 -1.0000000000 -0.0003285328    # Does not incl 0

sample estimates:
prop 1 prop 2 
 0.015  0.020

Two-sided chi-squared contingency test. (Irrelevant because you say you want a one-sided test, but this test is inherently two-sided.)

chisq.test(MAT, cor=F)

        Pearson's Chi-squared test

data:  MAT
X-squared = 3.5904, df = 1, p-value = 0.05811

Related Solutions

[Math] Chi-squared goodness-of-fit test whether the data follows binomial distribution

Perhaps it is time for a more complete answer.

You have values $x = (0,1,2,3,4,5)$ with corresponding observed frequencies $f = (20,75,145,140,85,35),$ totaling $m = \sum_x f_x = 500.$ Relative frequencies are $r_x = f_x/m.$

Thus the sample mean is $\bar X = \sum_x xr_x = 2.6,$ which is the estimate of the binomial mean $\mu = n\theta,$ where $n = 5$ is the number of independent trials and $\theta$ is the success probability. Because the estimated mean is $\bar X = \hat \mu = n\hat \theta = 2.6,$ we have $\hat \theta = 2.6/5 = 0.52.$

x = 0:5; f = c(20,75,145,140,85,35) 
sum(f) 
## 500 
r = f/sum(f);  sum(x*f)/500 
## 2.6 
p = 2.6/5;  p 
## 0.52

Our null hypothesis is that the distribution $Binom(5, 0.52)$ is an appropriate model for the number of successes. Under this distribution the probabilities $p_x$ of the values $x$ are given as pdf in the R output below. Under this binomial model, the expected frequencies $E_x = 500p_x,$ also shown below.

pdf = dbinom(x, 5, .52);  E = 500*pdf
cbind(x, f, r, pdf, E)
##     x   f    r       pdf         E
##     0  20 0.04 0.0254804  12.74020
##     1  75 0.15 0.1380188  69.00941
##     2 145 0.29 0.2990408 149.52038
##     3 140 0.28 0.3239608 161.98042
##     4  85 0.17 0.1754788  87.73939
##     5  35 0.07 0.0380204  19.01020

The chi-squared goodness-of-fit (GOF) statistic is $$Q = \sum_x \frac{(f_x - E_x)^2}{E_x},$$ which is approximately distributed as $Chisq(df = 6-2).$ The approximation is valid because all of the $E_x > 5.$ If we had been given the specific binomial parameters $n$ and $\theta,$ the degrees of freedom would have been $6 - 1,$ but we have estimated one parameter $\theta$ so $df = 6-2 = 4.$

Good agreement ('fit') between the observed frequencies $f_x$ and the expected frequencies $E_x$ gives small values of $Q.$ (Very unlikely perfect agreement would give $Q = 0.$) We reject the model $Binom(5, 0.52)$ for large values of $Q$.

The 95th percentile of $Chisq(df = 4)$ is 9.49, so we reject at the 5% level of significance if $Q > 9.49.$ For our data, the value of the test statistic is $Q = 21.31$, so we reject the model as unreasonable at the 5% level. The P-value 0.0003 is the probability $P(Q > 21.31),$ if the true distribution were $Q \sim Chisq(df = 4).$ The chances of seeing the observed frequencies $f_x$ if the true model where $Binom(5, 0.52)$ are very small. [Looking at the earlier table, we see one major discrepancy is that we have many fewer (140) than the expected number (161.98) of families with three boys.]

Q = sum((f-E)^2/E); Q 
## 21.31109 
qchisq(.95, 6-2) 
## 9.487729
1 - pchisq(21.3, 4)
## 0.0002761148

One way to visualize the poor fit is to compare the probabilities $p_x$ (bars) from the hypothetical model with the relative frequencies ($\times$'s) $r_x$ actually observed. But such graphical comparisons do not reveal the total number of cases observed, and so they should only be viewed along with results from formal GOF tests.

Using t-test to compare bias of means between uniform distributions

It makes sense to calculate the variance because it's not about the variance of the data points as much as it is about the variance of the averages. The probability distribution of the averages converge to a normal distribution as N grows (that's called the Central Limit Theorem).

Best Answer

Related Solutions

[Math] Chi-squared goodness-of-fit test whether the data follows binomial distribution

Using t-test to compare bias of means between uniform distributions

Related Question