I'm trying to understand the reasoning by choosing a specific test approach when dealing with a simple A/B test – (i.e. two variations/groups with a binary respone (converted or not). As an example I will be using the data below
Version Visits Conversions
A 2069 188
B 1826 220
The top answer here is great and talks about some of the underlying assumptions for z, t and chi square tests. But what I find confusing is that different online resources will cite different approaches, and you would think the assumptions for a basic A/B test should be pretty much the same?
- For instance, this article uses z-score:
- This article uses the following formula (which I'm not sure if it's different from the zscore calculation?):
- This paper references the t test(p 152):
So what arguemnts can be made in favor of these different approaches? Why would one have a preference?
To throw in one more candidate, the table above can be rewritten as a 2×2 contingency table, where Fisher's exact test (p5) can be used
Non converters Converters Row Total
Version A 1881 188 2069
Versions B 1606 220 1826
Column Total 3487 408 3895
But according to this thread fisher's exact test should only be used with smaller sample sizes (what's the cut off?)
And then there's paired t and z tests,f test (and logistic regression, but I want to leave that out for now)….I feel like I'm drowning in different test approaches, and I just want to be able to make some kind of argument for the different methods in this simple A/B test case.
Using the example data I'm getting the following p-values
-
https://vwo.com/ab-split-test-significance-calculator/ gives a
p-value of 0.001 (z-score) -
http://www.evanmiller.org/ab-testing/chi-squared.html (using chi
square test) gives a p-value of 0.00259 -
And in R
fisher.test(rbind(c(1881,188),c(1606,220)))$p.value
gives
a p-value of 0.002785305
Which I guess are all pretty close…
Anyway – just hoping for some healthy discussion on what approaches to use in online testing where sample sizes are usually in the thousands, and response ratios are often 10% or less. My gut is telling me to use chi-square, but I want to be able to answer exactly why I'm choosing it over the other multitude of ways to do it.
Best Answer
We use these tests for different reasons and under different circumstances.
$z$-test. A $z$-test assumes that our observations are independently drawn from a Normal distribution with unknown mean and known variance. A $z$-test is used primarily when we have quantitative data. (i.e. weights of rodents, ages of individuals, systolic blood pressure, etc.) However, $z$-tests can also be used when interested in proportions. (i.e. the proportion of people who get at least eight hours of sleep, etc.)
$t$-test. A $t$-test assumes that our observations are independently drawn from a Normal distribution with unknown mean and unknown variance. Note that with a $t$-test, we do not know the population variance. This is far more common than knowing the population variance, so a $t$-test is generally more appropriate than a $z$-test, but practically there will be little difference between the two if sample sizes are large.
With $z$- and $t$-tests, your alternative hypothesis will be that your population mean (or population proportion) of one group is either not equal, less than, or greater than the population mean (or proportion) of the other group. This will depend on the type of analysis you seek to do, but your null and alternative hypotheses directly compare the means/proportions of the two groups.
Chi-squared test. Whereas $z$- and $t$-tests concern quantitative data (or proportions in the case of $z$), chi-squared tests are appropriate for qualitative data. Again, the assumption is that observations are independent of one another. In this case, you aren't seeking a particular relationship. Your null hypothesis is that no relationship exists between variable one and variable two. Your alternative hypothesis is that a relationship does exist. This doesn't give you specifics as to how this relationship exists (i.e. in which direction the relationship goes) but it will provide evidence that a relationship does (or does not) exist between your independent variable and your groups.
Fisher's exact test. One drawback to the chi-squared test is that it is asymptotic. This means that the $p$-value is accurate for very large sample sizes. However, if your sample sizes are small, then the $p$-value may not be quite as accurate. As such, Fisher's exact test allows you to exactly calculate the $p$-value of your data and not rely on approximations that will be poor if your sample sizes are small.
I keep discussing sample sizes - different references will give you different metrics as to when your samples are large enough. I would just find a reputable source, look at their rule, and apply their rule to find the test you want. I would not "shop around", so to speak, until you find a rule that you "like".
Ultimately, the test you choose should be based on a) your sample size and b) what form you want your hypotheses to take. If you are looking for a specific effect from your A/B test (for example, my B group has higher test scores), then I would opt for a $z$-test or $t$-test, pending sample size and the knowledge of the population variance. If you want to show that a relationship merely exists (for example, my A group and B group are different based on the independent variable but I don't care which group has higher scores), then the chi-squared or Fisher's exact test is appropriate, depending on sample size.
Does this make sense? Hope this helps!