Solved – Complex experimental design for A/B testing

ab-testbootstrapexperiment-designindependence

I am facing the following situation. I am running a web service for which I want to do A/B testing. The two versions A and B differ in the formatting/ordering/content of the web service answer to a user request.

The criterion I compute is the click-through rate in each version, that is to say the number of times users have clicked on the request results over the number of requests made.

However, the main issue is that I do not have access to any end-user/cookie id and thus cannot map directly users to versions (A or B). I can only map requests (randomly or based on their features) to versions.

This means that the assumption that end users assignment to versions is perfectly randomized, traditionally used in A/B testing framework, is not verified here.
Furthermore, this means that some requests in version A are dependent on some others in version B and reciprocally.

My questions

Is it meaningful to use bootstrap confidence intervals to compute confidence intervals of the difference between click-through rates?
Provided I cannot identify end-users, how could I change my experiment setting to improve it and make it easier to interpret?

My first guesses:

I cannot compute any asymptotical confidence interval using Central Limit Theorem as the independence between each request is not validated (2 requests may come from the same user and be dependent).

I cannot use bootstrapping because it uses the fact that both versions are independent when re-sampling them independently…

Best Answer

No you cannot use classical bootstrapping in a dependent context (Longer explanation here). When using boostraping, you sample observations from version A independently from observations of version B. Thus, you do as if Bernoulli variables in each version were independent.

There exists several bootstrapping methods for dependent variables. However, they assume/make use of the dependence structure in the data (weakly-dependent time-series for example). Block bootstrap is one popular method for this. See this presentation and wikipedia.

Here you cannot make use of this type of information, as you have little knowledge on the correlation structure of your data and, most importantly, because you cannot even know which observation comes from which user, so that you cannot define the so-called blocks.

Your next step could be to validate experimentally your hypothesis: the dependence between your observations could be very limited and central limit theorem/bootstrap method might apply approximately nevertheless. Run multiple A/A tests and check if the true value (zero) is contained in the asymptotical confidence intervals with an adequate confidence level.

Related Solutions

Statistical Significance – Bootstrapping for Two-Proportion Test with Grouped Data

Yes, you describe correctly how to perform resampling bootstrap when the randomization unit is user and the analysis unit is pageview: sample users with replacement.

Yes, there are better (= more computationally efficient) approaches.

Note: By the A/B test design, groups A and B are independent. So we only need to consider one group and show how to define an estimator $\bar{X}$ of its CTR and how to estimate the variance $\mathbb{V}(\bar{X})$ of this estimator. Then we can test whether the click-through rates in groups A and B are the same with the z-statistic $\left(\bar{X}_A - \bar{X}_B\right) / \sqrt{ \mathbb{V}(\bar{X}_A) + \mathbb{V}(\bar{X}_B) }$.

The Delta method gives a formula for the standard error of the CTR (technically, it derives an asymptotically consistent estimator of the variance). No repeated sampling necessary. [1]

We can use the Delta method to estimate the mean CTR and its variance with time complexity $\mathcal{O}(n)$ and space complexity $\mathcal{O}(1)$, where $n$ is the number of users, by passing through the data once while keeping track of a small number of sample quantities: two means, two variances and one covariance.

The reweighting bootstrap assigns a random weight to the pageviews of each user; it corresponds to the number of times a user is added into a bootstrap sample. What's the trick? If user $i$ is selected $k$ times its pageviews and click-throughs contribute $k$ times to the totals. Since the totals ignore the sampling order, we can simplify multiply user $i$ statistics by (the weight) $k$. [2]

Reweighting bootstrap (aka block bootstrap) can be implemented almost as efficiently as the Delta method, with time complexity $\mathcal{O}(n)$ and space complexity $\mathcal{O}(b)$ where $b$ is the number of boostrap replicates. See the simulation below.

[1] S. Deng, R. Longbotham, T. Walker, and Y. Xu. Choice of randomization unit in online controlled experiments. In Joint Statistical Meetings Proceedings, 4866–4877, 2011

[2] E. Bakshy and D. Eckles. Uncertainty in online experiments with dependent data: An evaluation of bootstrap methods. In Proceedings of KDD'13, 1303–1311, 2013.

Let's demonstrate how to use the Delta method and the reweighting bootstrap in your use case.

Set up the data.

set.seed(1234)

n <- 10000

# Simulate the CTR for n users
user_clt <- rbeta(n, 0.1, 0.5)
# Simulate the pageviews for n users
user_pvs <- rpois(n, 6)
# Simulate the number of click-throughs for n users
user_clicks <- rbinom(n, user_pvs, user_clt)

Apply the Delta method.

# We can compute sample means, variances and covariances with a single pass
# because they can be expressed as (a function of) sample averages
EKi <- mean(user_pvs)
EYi <- mean(user_clicks)
# Var{X} = E{X^2} - E{X}^2
VarKi <- mean(user_pvs^2) - EKi^2
VarYi <- mean(user_clicks^2) - EYi^2
# Cov{X,Y} = E{XY} - E{X}*E{Y}
CovYiKi <- mean(user_clicks * user_pvs) - EKi * EYi

# See page 3 of Deng et al. 2011
ctr_obs <- sum(user_clicks) / sum(user_pvs)
ctr_var <- (1 / (EKi^2) * VarYi + (EYi^2) / (EKi^4) * VarKi - 2 * EYi / (EKi^3) * CovYiKi) / n

c(ctr_obs, ctr_var)
#> [1] 1.669751e-01 1.083011e-05

Apply the reweighting bootstrap.

# number of bootstrap replicates
b <- 2000

# In this simulation, I materialize the matrix of weights.
# But in practice, we will loop through (the large number of) users once.
weights <- matrix(rpois(n * b, 1), n, b)

x <- user_clicks %*% weights
y <- user_pvs %*% weights

ctr_obs <- as.numeric(x / y)

c(mean(ctr_obs), var(ctr_obs))
#> [1] 1.669962e-01 1.078134e-05

Best Answer

Related Solutions

Statistical Significance – Bootstrapping for Two-Proportion Test with Grouped Data

Related Question