Solved – How is the Upper Confidence Bound derived

confidence intervalmachine learningmultiarmed-banditreinforcement learning

I came across the formula for obtaining the upper confidence bounds on the k-armed bandit problem:

$$c\sqrt{\frac{\text{ln} N_i}{n_i}}$$

where $n_i$ is the number of samples we have for this particular bandit and $N_i$ is the total amount of samples we have from all bandits. The same algorithm is used in the Monte Carlo Tree Search as well to obtain the upper confidence bound.

I understand very clearly what an upper confidence bound is, but what I don't understand is where this formula comes from. I have tried looking online in several places but could not find a clear explanation of how this formula is derived. Can someone please explain where this formula comes from? Please assume I don't have a great background in statistics.

Best Answer

What you have there is commonly called the exploration term. The upper confidence bound is the empirical mean plus this exploration term.

Let's consider each term separately:

$c$ is a constant which lets the user set the exploration/exploitation trade-off. For theoretical results it is often optimized for the problem at hand (e.g. k-armed bandits with Gaussian priors).

$\sqrt{1/n_i}$ is proportional to the posterior standard deviation after $n_i$ samples of action $i$. Essentially this says that as you pull an arm more often, there is less unknown about the arm.

$\sqrt{ln(N_i)}$ ensures that you don't stop exploring too early. As $N_i$ becomes very large, the sample variances become small enough that we need to compensate to ensure that we never completely stop exploring. Most of the technical math is to show that $\sqrt{ln(N_i)}$ is just enough (but not too much) compensation.

For a more technical description, the paper by Auer et al. is a good starting point.

Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer. "Finite-time analysis of the multiarmed bandit problem." Machine learning 47.2-3 (2002): 235-256.

Related Question