A/B Testing – Comparing Z-Test, T-Test, Chi-Square, and Fisher Exact Test

chi-squared-testfishers-exact-testp-valuestatistical significancez-statistic

I'm trying to understand the reasoning by choosing a specific test approach when dealing with a simple A/B test – (i.e. two variations/groups with a binary respone (converted or not). As an example I will be using the data below

Version  Visits  Conversions
A        2069     188
B        1826     220

The top answer here is great and talks about some of the underlying assumptions for z, t and chi square tests. But what I find confusing is that different online resources will cite different approaches, and you would think the assumptions for a basic A/B test should be pretty much the same?

  1. For instance, this article uses z-score:enter image description here
  2. This article uses the following formula (which I'm not sure if it's different from the zscore calculation?):

enter image description here

  1. This paper references the t test(p 152):

enter image description here

So what arguemnts can be made in favor of these different approaches? Why would one have a preference?

To throw in one more candidate, the table above can be rewritten as a 2×2 contingency table, where Fisher's exact test (p5) can be used

              Non converters  Converters  Row Total
Version A     1881            188         2069  
Versions B    1606            220         1826
Column Total  3487            408         3895

But according to this thread fisher's exact test should only be used with smaller sample sizes (what's the cut off?)

And then there's paired t and z tests,f test (and logistic regression, but I want to leave that out for now)….I feel like I'm drowning in different test approaches, and I just want to be able to make some kind of argument for the different methods in this simple A/B test case.

Using the example data I'm getting the following p-values

  1. https://vwo.com/ab-split-test-significance-calculator/ gives a
    p-value of 0.001 (z-score)

  2. http://www.evanmiller.org/ab-testing/chi-squared.html (using chi
    square test) gives a p-value of 0.00259

  3. And in R fisher.test(rbind(c(1881,188),c(1606,220)))$p.value gives
    a p-value of 0.002785305

Which I guess are all pretty close…

Anyway – just hoping for some healthy discussion on what approaches to use in online testing where sample sizes are usually in the thousands, and response ratios are often 10% or less. My gut is telling me to use chi-square, but I want to be able to answer exactly why I'm choosing it over the other multitude of ways to do it.

Best Answer

We use these tests for different reasons and under different circumstances.

  1. $z$-test. A $z$-test assumes that our observations are independently drawn from a Normal distribution with unknown mean and known variance. A $z$-test is used primarily when we have quantitative data. (i.e. weights of rodents, ages of individuals, systolic blood pressure, etc.) However, $z$-tests can also be used when interested in proportions. (i.e. the proportion of people who get at least eight hours of sleep, etc.)

  2. $t$-test. A $t$-test assumes that our observations are independently drawn from a Normal distribution with unknown mean and unknown variance. Note that with a $t$-test, we do not know the population variance. This is far more common than knowing the population variance, so a $t$-test is generally more appropriate than a $z$-test, but practically there will be little difference between the two if sample sizes are large.

With $z$- and $t$-tests, your alternative hypothesis will be that your population mean (or population proportion) of one group is either not equal, less than, or greater than the population mean (or proportion) of the other group. This will depend on the type of analysis you seek to do, but your null and alternative hypotheses directly compare the means/proportions of the two groups.

  1. Chi-squared test. Whereas $z$- and $t$-tests concern quantitative data (or proportions in the case of $z$), chi-squared tests are appropriate for qualitative data. Again, the assumption is that observations are independent of one another. In this case, you aren't seeking a particular relationship. Your null hypothesis is that no relationship exists between variable one and variable two. Your alternative hypothesis is that a relationship does exist. This doesn't give you specifics as to how this relationship exists (i.e. in which direction the relationship goes) but it will provide evidence that a relationship does (or does not) exist between your independent variable and your groups.

  2. Fisher's exact test. One drawback to the chi-squared test is that it is asymptotic. This means that the $p$-value is accurate for very large sample sizes. However, if your sample sizes are small, then the $p$-value may not be quite as accurate. As such, Fisher's exact test allows you to exactly calculate the $p$-value of your data and not rely on approximations that will be poor if your sample sizes are small.

I keep discussing sample sizes - different references will give you different metrics as to when your samples are large enough. I would just find a reputable source, look at their rule, and apply their rule to find the test you want. I would not "shop around", so to speak, until you find a rule that you "like".

Ultimately, the test you choose should be based on a) your sample size and b) what form you want your hypotheses to take. If you are looking for a specific effect from your A/B test (for example, my B group has higher test scores), then I would opt for a $z$-test or $t$-test, pending sample size and the knowledge of the population variance. If you want to show that a relationship merely exists (for example, my A group and B group are different based on the independent variable but I don't care which group has higher scores), then the chi-squared or Fisher's exact test is appropriate, depending on sample size.

Does this make sense? Hope this helps!