Solved – Complex experimental design for A/B testing

ab-testbootstrapexperiment-designindependence

I am facing the following situation. I am running a web service for which I want to do A/B testing. The two versions A and B differ in the formatting/ordering/content of the web service answer to a user request.

The criterion I compute is the click-through rate in each version, that is to say the number of times users have clicked on the request results over the number of requests made.

However, the main issue is that I do not have access to any end-user/cookie id and thus cannot map directly users to versions (A or B). I can only map requests (randomly or based on their features) to versions.

This means that the assumption that end users assignment to versions is perfectly randomized, traditionally used in A/B testing framework, is not verified here.
Furthermore, this means that some requests in version A are dependent on some others in version B and reciprocally.

My questions

  1. Is it meaningful to use bootstrap confidence intervals to compute confidence intervals of the difference between click-through rates?
  2. Provided I cannot identify end-users, how could I change my experiment setting to improve it and make it easier to interpret?

My first guesses:

I cannot compute any asymptotical confidence interval using Central Limit Theorem as the independence between each request is not validated (2 requests may come from the same user and be dependent).

I cannot use bootstrapping because it uses the fact that both versions are independent when re-sampling them independently…

Best Answer

No you cannot use classical bootstrapping in a dependent context (Longer explanation here). When using boostraping, you sample observations from version A independently from observations of version B. Thus, you do as if Bernoulli variables in each version were independent.

There exists several bootstrapping methods for dependent variables. However, they assume/make use of the dependence structure in the data (weakly-dependent time-series for example). Block bootstrap is one popular method for this. See this presentation and wikipedia.

Here you cannot make use of this type of information, as you have little knowledge on the correlation structure of your data and, most importantly, because you cannot even know which observation comes from which user, so that you cannot define the so-called blocks.

Your next step could be to validate experimentally your hypothesis: the dependence between your observations could be very limited and central limit theorem/bootstrap method might apply approximately nevertheless. Run multiple A/A tests and check if the true value (zero) is contained in the asymptotical confidence intervals with an adequate confidence level.

Related Question