Solved – Hierarchical Bayesian Model (?)

Please apologize my butchering of statistical lingo 🙂 I have found a couple of questions on here that are related to advertising and click through rates. But none of them helped me very much with my understanding of my hierarchical situation.

There's a related question Are these equivalent representations of the same hierarchical Bayesian model?, but I'm not sure if they actually have a similar problem. Another question Priors for hierarchial Bayesian binomial model goes into detail about hyperpriors, but I'm not able to map their solution to my problem

I have a couple of ads online for a new product. I let the ads run for a couple of days. At that point enough people have clicked on the ads to see which one gets the most clicks. After kicking out all but the one that has the most clicks I let that one run for another couple of days to see how much people actually purchase after clicking on the ad. At that point I know if it was a good idea to run the ads in the first place.

My statistics are very noisy because I don't have a lot of data since I'm only selling a couple of items every day. Therefore it is really hard to estimate how many people buy something after seeing an ad. Only about one in every 150 clicks results in a purchase.

Generally speaking I need to know if I'm losing money on each ad as soon as possible by somehow smoothing the per-ad group statistics with global statistics over all ads.

If I wait until every ad has seen enough purchases I'll go broke because it takes too long: testing 10 ads I need to spend 10 times more money so that the statistics for each ad gets reliable enough. By that time I might have lost money.
If I average purchases over all of the ads I won't be able to kick out ads that just aren't working as well.

Could I use the global purchase rate ($ per click) and use it as a prior for $N$ sub-distributions? That would mean that the more data I have for each ad, the more independent the statistics for that ad get. If nobody has clicked on an ad yet, I assume that the global average is appropriate.

Which distribution would I choose for that?

If I have had 20 clicks on A and 4 clicks on B, how can I model that? So far I have figured out that a binomial or Poisson distribution might make sense here:

purchase_rate ~ poisson (?)
(purchase_rate | group A) ~ poisson (estimate the purchase rate only for the group A?)

But what do I do next to actually calculate the purchase_rate | group A. How do I plug two distributions together to make sense for group A (or any other group).

Do I have to fit a model first? I have data that I could use to "train" a model:

Ad A: 352 clicks, 5 purchases
Ad B: 15 clicks, 0 purchases
Ad C: 3519 clicks, 130 purchases

I'm looking for a way to estimate the probability of any one of the groups. If a group has only a couple of datapoints I essentially want to fall back to the global average. I know a bit about Bayesian statistics and have read a lot of PDFs of people describing how they model using Bayesian inference and conjugate priors and so on. I think there is a way to do this properly but I can't figure out how to model it correctly.

I would be super happy about hints that help me formulate my problem in a Bayesian way. That would help a lot with finding examples online that I could use to actually implement this.

Update:

Thanks so much for responding. I'm beginning to understand more and more little bits about my problem. Thank you! Let me ask a few questions to see if I understand the problem a bit better now:

So I assume the conversions are are distributed as Beta-distributions, and a Beta distribution has two parameters, $a$ and $b$.

The $\frac{1}{2}$ $\frac{1}{2}$ parameters are hyperparameters, so they are parameters to the prior? So in the end I set the number of conversions and number of clicks as the parameter of my Beta distribution?

At some point when I want to compare different ads, so I would compute $P(\mathrm{conversion} | \mathrm{ad}=X) = \frac{P(\mathrm{ad}=X | \mathrm{conversion}) * P(\mathrm{conversion})}{P(\mathrm{ad}=X)}$. How do I compute each part of that formula?

I think $P(\mathrm{ad}=X | \mathrm{conversion})$ is called likelihood, or "mode" of the Beta distribution. So that's $\frac{\alpha – 1}{\alpha + \beta – 2}$, with $\alpha$ and $\beta$ being the parameters of my distribution. But the specific $\alpha$ and $\beta$ here are the parameters for the distribution just for the ad $X$, right? In that case, is it just the number of clicks and conversions this ad has seen? Or is how many clicks/conversions all ads have seen?
Then I multiply with the prior, which is P(conversion), which is in my case just the Jeffreys prior, which is non-informative. Will the prior stay the same as I get more data?
I divide by $P(\mathrm{ad})$, which is the marginal likelihood, so I count how often this ad has been clicked?

In using Jeffreys' prior, I'm assuming that I'm starting at zero and don't know anything about my data. That prior is called "non-informative". As I continue learning about my data, do I update the prior?

As clicks and conversions come in, I have read that I have to "update" my distribution. Does this mean, that the parameters of my distribution changes, or that the prior changes? When I get a click for ad X, do I update more than one distribution? More than one prior?

Best Answer

As you intuited, a very general way of addressing your question is to construct a hierarchical (multilevel) Bayesian model. The model has three parts, as illustrated below.

Model

Hierarchical Bayesian model for ad conversion rates

At the population level, we model the conversion probability in the population of ads from which your particular set of tested ads is sampled. One could fix the population parameters and use them as a prior for the second level, as was noted before by Neil. Alternatively, we could place a prior on the population parameters themselves, which provides the additional advantage that we can now express our uncertainty about the population parameters in light of the data. Let's follow this route and place a prior $\mathcal{N}(\mu \mid \mu_0, \eta_0)$ on the population mean $\mu$ and $\textrm{Ga}(\lambda \mid a_0, b_0)$ on the population precision (i.e., inverse variance). A diffuse prior can be obtained using $\mu_0 = 0, \eta_0 = 0.1, a_0 = 1, b_0 = 1$, which ensures our posterior inferences will be dominated by the data.
At the level of individual ads, we can model the conversion probability $\pi_j$ of a given ad $j$ as logit-normally distributed. Thus, for each ad $j$, the logit conversion probability $\rho_j := \textrm{logit}(\pi_j)$ is modelled as $\mathcal{N}(\rho_j \mid \mu,\lambda)$.
Finally, at the level of observed data, we model the number of conversions $k_j$ for ad $j$ as $\textrm{Bin}(k_j \mid \sigma(\rho_j), n_j)$, where $\sigma(\rho_j)$ uses the sigmoid transform to translate a logit rate back into a probability, and where $n_j$ is the number of clicks on ad $j$.

Data

As an example, let's take the data you posted in your original question,

Ad A: 352 clicks, 5 purchases

Ad B: 15 clicks, 0 purchases

Ad C: 3519 clicks, 130 purchases

which we translate into: $n_1 = 352, k_1 = 5, n_2 = 15, k_2 = 0, \ldots$

Inference

Inverting this model means to obtain posterior distributions for our model parameters. Here, I used a variational Bayes approach to model inversion, which is computationally more efficient than stochastic sampling schemes such as MCMC. I have plotted the results below.

Data and resulting posteriors

The figure shows three panels. (a) A simple visualization of the example data you provided. The grey bars represent the number of clicks, the black bars show the number of conversions. (b) The resulting posterior distribution over the population mean conversion rate. As we observe more data, this will become more and more precise. (c) Central 95% posterior probability intervals (or credible intervals) of the ad-specific posterior conversion rates.

The last panel illustrates two key features of a Bayesian approach to hierarchical modelling. First, the precision of the posteriors reflects the number of underlying data points. For example, we have relatively many data points for ad C; thus, its posterior is much more precise than the posteriors of the other ads.

Second, ad-specific inferences are informed by knowledge about the population. In other words, ad-specific posteriors are based on data from the entire group, an effect known as shrinking to the population. For example, the posterior mode (black circle) of ad A is much higher than its empirical conversion rate (blue). This is because all other ads have higher posterior modes, and thus we can obtain a better estimate of ground truth by informing our ad-specific estimates by the group mean. The less data we have about a particular ad, the more will its posterior be influenced by data from the other ads.

All of the ideas you described in your original question are accomplished naturally in the above model, illustrating the practical utility of a fully Bayesian setting.

Best Answer

Related Solutions

Related Question