I have a dataset from an AB test for clickthroughs on a website. We randomly divided users into A and B groups and counted an observation each time a user viewed the webpage. Each observation is one viewing of the webpage with 1 or 0 in the clicked_through
column if the user clicked through or didn't, respectively. A sample dataset (for either the A or B group) is given below. Notice that some users viewed the webpage more than once during the test period.
user_id clicked_through
------- ---------------
user_1 0
user_2 1
user_2 0
user_3 0
user_3 1
user_3 0
user_4 0
user_5 0
user_5 1
user_6 1
We want to test whether the clickthrough rate (CTR) for the B group is greater than the CTR for the A group at a given significance level. I don't think we can use an ordinary two-proportion Z-test because observations for the same user are not independent. I've seen similar questions (such as this one) where the Delta Method and bootstrapping were recommended. I want to use bootstrapping because I want to learn more about bootstrapping generally. However, I'm not sure exactly how to use bootstrapping with this data. Sampling rows with replacement seems wrong because we know the distribution of CTR when drawing a given number of times from the A and B datasets and so we can create confidence intervals analytically, probably replicating the ordinary two-proportion Z-test.
My idea is to randomly choose users from each group (with replacement) to create the bootstrapped datasets. For example, in one iteration we might choose users 1, 2, 3, 2, 6, 3 from the above dataset, giving this bootstrapped dataset:
user_id clicked_through
------- ---------------
user_1 0
user_2 1
user_2 0
user_3 0
user_3 1
user_3 0
user_2 1
user_2 0
user_6 1
user_3 0
user_3 1
user_3 0
I would compute CTR for each bootstrapped dataset to create confidence intervals for the A and B datasets, then compare the overlap of the confidence intervals to gauge significance.
Is this approach sound? If not, what's a better approach?
Additional info: Both the A and B datasets have ~12,000 unique users with ~16,000 page views and a ~40% CTR. The max number of views for a single user is 21. Given the relatively low ratio of page views to users, the two-proportion Z-test might not work so badly here, but we have other AB test datasets with far higher ratios so I want to find a bootstrapping approach that will work well generally.
Best Answer
Yes, you describe correctly how to perform resampling bootstrap when the randomization unit is user and the analysis unit is pageview: sample users with replacement.
Yes, there are better (= more computationally efficient) approaches.
Note: By the A/B test design, groups A and B are independent. So we only need to consider one group and show how to define an estimator $\bar{X}$ of its CTR and how to estimate the variance $\mathbb{V}(\bar{X})$ of this estimator. Then we can test whether the click-through rates in groups A and B are the same with the z-statistic $\left(\bar{X}_A - \bar{X}_B\right) / \sqrt{ \mathbb{V}(\bar{X}_A) + \mathbb{V}(\bar{X}_B) }$.
The Delta method gives a formula for the standard error of the CTR (technically, it derives an asymptotically consistent estimator of the variance). No repeated sampling necessary. [1]
We can use the Delta method to estimate the mean CTR and its variance with time complexity $\mathcal{O}(n)$ and space complexity $\mathcal{O}(1)$, where $n$ is the number of users, by passing through the data once while keeping track of a small number of sample quantities: two means, two variances and one covariance.
The reweighting bootstrap assigns a random weight to the pageviews of each user; it corresponds to the number of times a user is added into a bootstrap sample. What's the trick? If user $i$ is selected $k$ times its pageviews and click-throughs contribute $k$ times to the totals. Since the totals ignore the sampling order, we can simplify multiply user $i$ statistics by (the weight) $k$. [2]
Reweighting bootstrap (aka block bootstrap) can be implemented almost as efficiently as the Delta method, with time complexity $\mathcal{O}(n)$ and space complexity $\mathcal{O}(b)$ where $b$ is the number of boostrap replicates. See the simulation below.
[1] S. Deng, R. Longbotham, T. Walker, and Y. Xu. Choice of randomization unit in online controlled experiments. In Joint Statistical Meetings Proceedings, 4866–4877, 2011
[2] E. Bakshy and D. Eckles. Uncertainty in online experiments with dependent data: An evaluation of bootstrap methods. In Proceedings of KDD'13, 1303–1311, 2013.
Let's demonstrate how to use the Delta method and the reweighting bootstrap in your use case.
Set up the data.
Apply the Delta method.
Apply the reweighting bootstrap.