Hypothesis Testing in AB Tests – Aggregation-Level Strategies for Accurate Results

ab-testaggregationhypothesis testing

Hello fellow number crunchers

I hope this a valid question for this forum. I am a lonesome quarter-statistician and have trouble finding someone to ask.

Introduction:

The AB-Test has become really popular since it is so easy to implement and execute. Additionally the web is floated with blogs explaining how to determine the significance of the results. All in all, it seems that there is less discussion about the control or exclusion of possible "influential" variables (on the other hand, controlling such variables is quite hard on the web).

Most AB-Tests comparing the outcomes of both groups by simple counting how many clicks or conversions every group has generated. Than a binomial distribution is assumed for each group and hence statistical tests are performed to see which group got the greater p.

So the question is:

Is it "better" to compare the outcome of both groups without any aggregation or it is "better" to aggregate e.g. on daily basis ?

Example:
The AB-Test is to check whether a landing page creates more newsletter subscriptions (<- conversions in this case). The AB-Test is deployed/online the whole test-time. In group A the landing page's main color is blue, in group B the main color is red. Assume 2000 visitors per day, i.e. each group gets roughly 1000 visitors. In this case "without aggregation" means, that I get 1000 datapoints per group per day meanwhile "aggregation on daily basis" means, that I get one (!) datapoint per group per day.

Discussion

The latter (aggregation on daily baiss) would allow the pairing of values, which in turn can capture daily effects like peaks in user behavior and preferences, but it extends the duration of AB-Tests, because it takes longer to collect enough datapoints. On the other hand, the former (no aggregation) seems to the strategy of the majority, because … I dont know, maybe because a) with enough traffic you can make nearly anything significant within one day b) it is the easiest thing to do.

One (possibly influencing) example to stimulate your thoughts:
Assume that on one day during the test the "National Blue Day" is celebrated, so the color blue together with positive emotions is visible in all the media. This day group A has created a ton of conversions more than group B.

This difference clearly affects the test if aggregated on daily basis (increase of variance) or (if not aggregated at all) it either vanishes in the sea of data (in the case that multiple days are collected without aggregation) or it leads to the wrong results (if the day is the first and only day of the test).

Another example: Assume that the landing page belongs to a vegetable company. One day the "National Vegetable Day" is celebrated and now everyone wants to subscribe to the newsletter, no matter what the color is. This short-time effect is captured by aggregation on daily basis and a paired test, but it increases the variance in the case of no aggregation (because no paired test can performed here (is this even correct ?))

All in all: Am I on the right track or do I miss something completely ?

Best Answer

If the treatment is randomly assigned the aggregation won't matter in determining the effect of the treatment (or the average treatment effect). I use lowercase in the following examples to refer to disaggregated items and uppercase to refer to aggregated items. Lets a priori state a model of individual decision making, where $y$ is the outcome of interest, and $x$ represents when an observation recieved the treatment;

$y = \alpha + b_1(x) + b_2(z) + e$

When one aggregates, one is simply summing random variables. So one would observe;

$\sum y = \sum\alpha + \beta_1(\sum x) + \beta_2(\sum z) + \sum e$

So what is to say that $\beta_1$ (divided by its total number of elements, $n$) will equal $b_1$? Because by the nature of random assignment all of the individual components of $x$ are orthogonal (i.e. the variance of $(\sum x)$ is simply the sum of the individual variances), and all of the individual components are uncorrelated with any of the $z$'s or $e$'s in the above equation.

Perhaps using an example of summing two random variables will be more informative. So say we have a case where we aggregate two random variables from the first equation presented. So what we observe is;

$(y_i + y_j) = (\alpha_1 + \alpha_2) + \beta_1(x_i + x_j) + \beta_2(z_i + z_j) + (e_1 + e_2)$

This can subsequently be broken down into its individual components;

$(y_i + y_j) = \alpha_1 + \alpha_2 + b_1(x_i) + b_2(x_j) + b_3(z_i) + b_4(z_j) + e_1 + e_2$

By the nature of random assignment we expect $x_i$ and $x_j$ in the above statement to be independent of all the other parameters ($z_i$, $z_j$, $e_1$, etc.) and each other. Hence the effect of the aggregated data is equal to the effect of the data disaggregated (or $\beta_1$ equals the sum of $b_1$ and $b_2$ divided by two in this case).

This exercise is informative though to see where the aggregation bias will come into play. Anytime the components of that aggregated variable are not independent of the other components you are creating an inherent confound in the analysis (e.g. you can not independently identify the effects of each individual item). So going with your "blue day" scenario one might have a model of individual behavior;

$y = \alpha + b_1(x) + \beta_2(Z) + b_3(x*Z) + e$

Where $Z$ refers to whether the observation was taken on blue day and $x*Z$ is the interaction of the treatment effect with it being blue day. This should be fairly obvious why it would be problematic if you take all of your observations on one day. If treatment is randomly assigned $b_1(x)$ and $\beta_2(Z)$ should be independent, but $b_1(x)$ and $b_3(x*Z)$ are not. Hence you will not be able to uniquely identify $b_1$, and the research design is inherently confounded.

You could potentially make a case for doing the data analysis on the aggregated items (aggregated values tend to be easier to work with and find correlations, less noisy and tend to have easier distributions to model). But if the real questions is to identify $b_1(x)$, then the research design should be structured to appropriately identify it. While I made an argument above for why it does not matter in a randomized experiment, in many settings the argument that all of the individual components are independent is violated. If you expect specific effects on specific days, aggregation of the observations will not help you identify the treatment effect (it is actually a good argument to prolong the observations to make sure no inherent confounds are present).

Related Question