Multiarmed Bandit – Multi Armed Bandit for Continuous Rewards: Extended Question and In-depth Analysis

beta-binomial distributiongaussian processmultiarmed-banditnormal distributionreinforcement learning

This question is an extension to
A continuous generalization of the binary bandit

The Multi-Armed Bandit (MAB) Problem in general is described here: https://en.wikipedia.org/wiki/Multi-armed_bandit

Briefly, you have several Slot Machines / Bandits, each has a probability to give you a reward. As you don't know the reward probability beforehand, you try to find the best bandit by exploring arms (exploration) and exploiting the best bandits so far (exploitation).

In case you know the best arm, you could always pull this one and would get maximum reward possible. But as you don't know, you keep on exploring – the difference between the best possible reward and the reward that you gained is called the regret. So you want to minimize the regret. This is the exploration-exploitation-dilemma or -tradeoff.

For the binomial Multi-Armed Bandit Problem, several algorithms to explore and exploit exists: $\epsilon$-Greedy, Softmax, UCB1, Thompson Sampling, etc. Most of the resources available only treat the bandit problem with bernoulli rewards (0, 1).

For example, the $\epsilon$-Greedy algorithm is rather simple: with probability $\epsilon$ you explore (that is you pull a random arm), and with probability $1-\epsilon$ you choose the best arm so far, that is play the arm with max reward so far.

Empirically, Thompson Sampling (aka Bayesian Bandit) has shown good performance on minimizing the regret for binomial bandits. Thompson Sampling is what is described in the original question (see above) – you keep a beta-distribution for each arm and update it according to the trials and successes you have seen so far, take a sample from each distribution and choose the arm where $\text{sample} = max(\text{samples})$.

The Beta-distribution is the conjugate prior to the Binomial-distribution and is conjugate to itself. The Bayesian Update to Beta-Distribution is rather simple. As we know, the general Bayes rule is

$$
P(A|B) = \dfrac{P(B|A) * P(A)}{P(B)}
$$

If you do the Math and look at the Beta-Distribution, you'll find that the posterior distribution of $Beta(\alpha_{prior},\beta_{prior})$ is just $Beta(\alpha_{prior} + \text{success}, \beta_{prior} + \text{trials} – \text{success})$ as stated in the question above. Usually, the prior used in this case is the non-informative Jeffreys Prior (which in the Beta-Distribution case is $\alpha=\beta=\dfrac{1}{2}$)

This question now looks for a way to extend MAB-Problem to continuous rewards. One attempt could be to extend the Thompson-Sampling approach, which includes the following:

Choose a prior continuous distribution.
How to do a bayesian update to the continuous distribution, given the prior. Estimate mean and variance for normal distribution? In batches, or datapoint by datapoint? Or update the normal as seen in http://www.mhnederlof.nl/bayesnormalupdate.html ?
What does Gaussian Process mean in the context of continuous bandits? Checking the papers, a lot of approaches seem to utilize the GPs – why is that and what does that mean in plain english?
Having high variance data, the normal distributions tend to overlap a lot – thus, sampling from these distributions will lead to a high probability of exploration and not exploiting the best arm – which will lead to a high regret. How to avoid this / how to minimize the regret even in high variance data?

Or maybe there are other algorithms for the continuous MAB which would be suggested?

Best Answer

Choose a prior continuous distribution

Just choose the normal Jeffreys prior.

How to do a bayesian update to the continuous distribution, given the prior.

Yes, just a regular Bayesian update: the pointwise product of densities between the prior and the likelihood.

What does Gaussian Process mean in the context of continuous bandits? Checking the papers, a lot of approaches seem to utilize the GPs - why is that and what does that mean in plain english?

That entails everything that normal assumption entails. This could be a whole other question?

Having high variance data, the normal distributions tend to overlap a lot - thus, sampling from these distributions will lead to a high probability of exploration and not exploiting the best arm - which will lead to a high regret. How to avoid this / how to minimize the regret even in high variance data?

As you collect data, the normal distributions will have smaller and smaller variances, which makes your decisions increasingly deterministic. The variance of the data doesn't matter since the ideal action is the one with the highest expected return regardless of the variance of the return.

Best Answer

Related Solutions

Solved – Multi-armed bandit algorithms in Java

Solved – Linear Regret for epsilon-greedy algorithm in Multi-Armed Bandit problem

Related Question