Bayesian Reinforcement Learning – Finding the Best Arm in a Multi-Armed Bandit Without Exploitation

I have a problem similar to the 'Bernoulli bandit' problem in the exploration-exploitation paradigm, but without the exploitation element.

In particular, I have many levers that I can pull and each pull either succeeds or fails a proportion of the time that is fixed but unknown, and different for each lever. I want to find the lever that succeeds the most often. What's a good strategy to use if I want to minimise the expected number of pulls until I'm 99% sure I've found the one with the highest proportion of successes?

This differs from the usual multi-armed bandit problem in that during this testing I'm not rewarded for getting successes rather than failures. So I don't particularly favour strategies which get more successes. In particular I tried Thompson Sampling and I found that once it had found a good lever it spent too much time pulling that lever and not enough time trying to rule out the second best possibilities.

What would be a good algorithm for this problem? Can we say what would be optimal from a Bayesian perspective?

Best Answer

This answer is relevant to your question.

If you are interested in minimizing the number of pulls to identify the best arm, the setting you want to use is Best Arm Identification. In this setting, you have no notion of regret but you just aim at identifying the best arm as fast as possible (say in $\tau$ steps) and with a probability at least equal to $1-\delta$, for $\delta\in(0,1)$.

There are many algorithms that have been proposed to solve this problem. A provably (asymptotically for $\delta \to 0$) optimal algorithm is proposed by Garivier and Kaufmann. An algorithm $\mathcal{A}$ is optimal in the sense that no other algorithm can do better than $\mathcal{A}$ on a specific set of problems (e.g. with stochastic Gaussian or Bernoulli reward distributions).

There are plenty of other algorithms in the BAI literature (see Audibert et al. or Jamieson et al.). Thompson Sampling can also work optimally in Best Arm identification but you need to tweak the algorithm a bit and choose carefully the prior and update the posterior properly. This paper is a good reference for the pure exploration problem in Bayesian setting.

To find more algorithms you can just google "Best Arm Identification" or "Pure Exploration".

Best Answer

Related Solutions

Solved – Multi-armed bandit algorithms in Java

Multiarmed Bandit – Multi Armed Bandit for Continuous Rewards: Extended Question and In-depth Analysis

Related Question