Solved – Linear Regret for epsilon-greedy algorithm in Multi-Armed Bandit problem

machine learningmultiarmed-bandit

I am reading about $\epsilon$-greedy algorithm in Multi-Armed Bandit (or $K$ armed bandit) problem, as can be seen here: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies.

For the sake of completeness, I am stating the $\epsilon$-greedy algorithm briefly here.

  • The algorithm maintains an estimate $\hat\mu_i$ for the expectation of $i^{th}$ arm. Initially each of the arm is pulled once to initialize $\hat\mu_i$.
  • Next for each time step $t$ loop:
    • $k = arg max_{ 1 \leq i\leq K }\hat\mu_i$
    • with probability $1-\epsilon$ pull $arm_k$, and with probability $1-\epsilon$ pull any other arm.

Note that regret at time $T$ in this case is quantified as:
$Regret_T = T\mu^*-\sum_{t=1}^T{x_{i(t)}}$, where $\mu^*=max_{k\in\{1,…,K\}}\mu_k$, and $x_{i(t)}$ is reward at time $t$.

Now, it seems that expected cumulative regret can't be bounded by $log(T)$ (as a function of time $T$), and instead this algorithm has a linear regret.

It is not very clear to me the how one can prove this claim. Any help to elucidate this will be great.

Best Answer

If $\epsilon$ is a constant, then this has linear regret. Suppose that the initial estimate is perfect. Then you pull the `best' arm with probability $1-\epsilon$ and pull an imperfect arm with probability $\epsilon$, giving expected regret $\epsilon T = \Theta(T)$.

However the parameter $\epsilon$ is typically set to be a decreasing function of the iteration $t$. The trick is to decrease fast enough that the regret is close to the optimal $\sqrt{T}$ (or $\Delta \log(T)$ in the distribution dependent case), but slow enough that the estimate of which arm to choose converges to the optimum. For details on how to choose $\epsilon$, see Auer et al., who set it proportional to $1/t$.

Related Question