Statistical Significance – Bootstrapping for Two-Proportion Test with Grouped Data

ab-testbootstrapstatistical significance

I have a dataset from an AB test for clickthroughs on a website. We randomly divided users into A and B groups and counted an observation each time a user viewed the webpage. Each observation is one viewing of the webpage with 1 or 0 in the clicked_through column if the user clicked through or didn't, respectively. A sample dataset (for either the A or B group) is given below. Notice that some users viewed the webpage more than once during the test period.

user_id    clicked_through
-------    ---------------
user_1                   0
user_2                   1
user_2                   0
user_3                   0
user_3                   1
user_3                   0
user_4                   0
user_5                   0
user_5                   1
user_6                   1

We want to test whether the clickthrough rate (CTR) for the B group is greater than the CTR for the A group at a given significance level. I don't think we can use an ordinary two-proportion Z-test because observations for the same user are not independent. I've seen similar questions (such as this one) where the Delta Method and bootstrapping were recommended. I want to use bootstrapping because I want to learn more about bootstrapping generally. However, I'm not sure exactly how to use bootstrapping with this data. Sampling rows with replacement seems wrong because we know the distribution of CTR when drawing a given number of times from the A and B datasets and so we can create confidence intervals analytically, probably replicating the ordinary two-proportion Z-test.

My idea is to randomly choose users from each group (with replacement) to create the bootstrapped datasets. For example, in one iteration we might choose users 1, 2, 3, 2, 6, 3 from the above dataset, giving this bootstrapped dataset:

user_id    clicked_through
-------    ---------------
user_1                   0
user_2                   1
user_2                   0
user_3                   0
user_3                   1
user_3                   0
user_2                   1
user_2                   0
user_6                   1
user_3                   0
user_3                   1
user_3                   0

I would compute CTR for each bootstrapped dataset to create confidence intervals for the A and B datasets, then compare the overlap of the confidence intervals to gauge significance.

Is this approach sound? If not, what's a better approach?

Additional info: Both the A and B datasets have ~12,000 unique users with ~16,000 page views and a ~40% CTR. The max number of views for a single user is 21. Given the relatively low ratio of page views to users, the two-proportion Z-test might not work so badly here, but we have other AB test datasets with far higher ratios so I want to find a bootstrapping approach that will work well generally.

Best Answer

Yes, you describe correctly how to perform resampling bootstrap when the randomization unit is user and the analysis unit is pageview: sample users with replacement.

Yes, there are better (= more computationally efficient) approaches.

Note: By the A/B test design, groups A and B are independent. So we only need to consider one group and show how to define an estimator $\bar{X}$ of its CTR and how to estimate the variance $\mathbb{V}(\bar{X})$ of this estimator. Then we can test whether the click-through rates in groups A and B are the same with the z-statistic $\left(\bar{X}_A - \bar{X}_B\right) / \sqrt{ \mathbb{V}(\bar{X}_A) + \mathbb{V}(\bar{X}_B) }$.

The Delta method gives a formula for the standard error of the CTR (technically, it derives an asymptotically consistent estimator of the variance). No repeated sampling necessary. [1]

We can use the Delta method to estimate the mean CTR and its variance with time complexity $\mathcal{O}(n)$ and space complexity $\mathcal{O}(1)$, where $n$ is the number of users, by passing through the data once while keeping track of a small number of sample quantities: two means, two variances and one covariance.

The reweighting bootstrap assigns a random weight to the pageviews of each user; it corresponds to the number of times a user is added into a bootstrap sample. What's the trick? If user $i$ is selected $k$ times its pageviews and click-throughs contribute $k$ times to the totals. Since the totals ignore the sampling order, we can simplify multiply user $i$ statistics by (the weight) $k$. [2]

Reweighting bootstrap (aka block bootstrap) can be implemented almost as efficiently as the Delta method, with time complexity $\mathcal{O}(n)$ and space complexity $\mathcal{O}(b)$ where $b$ is the number of boostrap replicates. See the simulation below.

[1] S. Deng, R. Longbotham, T. Walker, and Y. Xu. Choice of randomization unit in online controlled experiments. In Joint Statistical Meetings Proceedings, 4866–4877, 2011

[2] E. Bakshy and D. Eckles. Uncertainty in online experiments with dependent data: An evaluation of bootstrap methods. In Proceedings of KDD'13, 1303–1311, 2013.


Let's demonstrate how to use the Delta method and the reweighting bootstrap in your use case.

Set up the data.

set.seed(1234)

n <- 10000

# Simulate the CTR for n users
user_clt <- rbeta(n, 0.1, 0.5)
# Simulate the pageviews for n users
user_pvs <- rpois(n, 6)
# Simulate the number of click-throughs for n users
user_clicks <- rbinom(n, user_pvs, user_clt)

Apply the Delta method.

# We can compute sample means, variances and covariances with a single pass
# because they can be expressed as (a function of) sample averages
EKi <- mean(user_pvs)
EYi <- mean(user_clicks)
# Var{X} = E{X^2} - E{X}^2
VarKi <- mean(user_pvs^2) - EKi^2
VarYi <- mean(user_clicks^2) - EYi^2
# Cov{X,Y} = E{XY} - E{X}*E{Y}
CovYiKi <- mean(user_clicks * user_pvs) - EKi * EYi

# See page 3 of Deng et al. 2011
ctr_obs <- sum(user_clicks) / sum(user_pvs)
ctr_var <- (1 / (EKi^2) * VarYi + (EYi^2) / (EKi^4) * VarKi - 2 * EYi / (EKi^3) * CovYiKi) / n

c(ctr_obs, ctr_var)
#> [1] 1.669751e-01 1.083011e-05

Apply the reweighting bootstrap.

# number of bootstrap replicates
b <- 2000

# In this simulation, I materialize the matrix of weights.
# But in practice, we will loop through (the large number of) users once.
weights <- matrix(rpois(n * b, 1), n, b)

x <- user_clicks %*% weights
y <- user_pvs %*% weights

ctr_obs <- as.numeric(x / y)

c(mean(ctr_obs), var(ctr_obs))
#> [1] 1.669962e-01 1.078134e-05
Related Question