Solved – Why is this regret a good choice for a multi-armed bandit

machine learningmultiarmed-banditsequential-analysis

The regret in a multi-arm bandit model is given by

$$\underset{j}{\max}\sum_{t=1}^{T}x_j(t) -G_{A}$$

where $$G_A=\sum_{t=1}^{T}x_{it}(t)$$ is the total reward achieved by the learner, based on an action $i$, taken at each time interval $t$, i.e $i_{t}\in {1,2,…,K}$ and $x_j(t)$ is the reward assosciated with action $j$, at time $t$ with $K$, being the total number of actions available.

In a stochastic multi-arm bandit setting where each arm is modeled by a population distribution:

a) What is the connection between the sequential decisions taken by a chosen learner that guarantees a minimization of the expectation of the above regret?

I ask this, as I do not see the expectation of the best arm being accounted for- directly in the given form of the regret. Instead, is there a bound that connects this regret to the population mean parameters?

I looked at the expected regret being $$\mathbb{E}\left[\underset{j}{\max}\sum_{t=1}^{T}x_j(t)\right]-\mathbb{E}G_A $$ and want to have a connection between the steps in any chosen multi-arm-bandit algorithm and the minimization of the expected regret. Is it done directly- or is it based on the minimization of say, an indirect upper bound?

Best Answer

As Michelle said, the first term is the utility of the omniscient learner, and the second term is the utility of your agent. Your goal is to devise a policy---a rule to select an action $i$ at time $t$---to minimize the difference, which we call the regret.

The crux of the problem is that you don't know the optimal arm at a particular time until you play it, and that's where the expectations come in. So the problem reduces to one of estimating the best arm at any moment based on the available information.