Solved – Purpose of Dirichlet noise in the AlphaZero paper

In DeepMind's AlphaGo Zero and AlphaZero papers, they describe adding Dirichlet noise to the prior probabilities of actions from the root node (board state) in Monte Carlo Tree Search:

Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the root node $s_0$, specifically $P(s, a) = (1−\varepsilon)p_a+ \varepsilon \eta_a$, where $\eta \sim \text{Dir}(0.03)$ and $\varepsilon = 0.25$; this noise ensures that all moves may be tried, but the search may still overrule bad moves.

(AlphaGo Zero)

And:

Dirichlet noise $\text{Dir}(\alpha)$ was added to the prior probabilities in the root node; this was scaled in inverse proportion to the approximate number of legal moves in a typical position, to a value of $\alpha = \{0.3, \; 0.15, \; 0.03\}$ for chess, shogi and Go respectively.

(AlphaZero)

Two things I don't understand:

P(s, a) is an $n$-dimensional vector. Is $\text{Dir}(\alpha)$ shorthand for the Dirichlet distribution with $n$ parameters, each with value $\alpha$?
I've only come across Dirichlet as the conjugate prior of the multinomial distribution. Why was it picked here?

For context, P(s, a) is just one component of the PUCT (polynomial upper confidence tree, a variant on upper confidence bounds) calculation for a given state/action. It's scaled by a constant and a metric for how many times the given action has been selected amongst its siblings during MCTS, and added to the estimated action value Q(s, a):

PUCT(s, a) = Q(s, a) + U(s, a).
$ U(s,a) = c_{\text{puct}} P(s,a) \frac{\sqrt{\sum_b N(s,b)}}{1 + N(s,a)} $.

Best Answer

Question 1 is straightforward, here $\alpha$ is a vector of repetitions of the given value. (As answered by Max S.)

Question 2 is more interesting: The Dirichlet distribution has the following interpretation relevant in this context: When $\alpha$ is the observed vector of outcome-counts drawn from some (unknown) categorical distribution with outcome probabilities $\pi$, then $Dir(\alpha)(\pi)$ is the likelihood that $Cat(\pi)$ is the actual underlying distribution given you observed $\alpha$ as the counts. (This is basically the definition of a dual distribution.)

Now P(s,a) estimates the probability that a good player would play a in s, that is the parameters of his categorical distribution, which AlphaZero wants to learn. So $Dir(\alpha)$ would sample reasonable estimates for $pi=$P(s,a) if we observed a good player play moves $\alpha$-times. But if some $\alpha_i=0$, then all $\pi\sim Dir(\alpha)$ have $\pi_i=0$, preventing exploration. By adding the noise they assume that they have observed every move being played some small number of times $\alpha$ (here chosen 0.3, 0.15, 0.03).

As for how they got the constants, my guess is that they assume to have observed ~10 random plays in every game: In chess, $Dir(0.3)$ assumes that you have seen each move played 0.3 times. Given that there are ~35 moves available according to Allis, the authors assume you have seen ~10 random moves in every node. In Go, if we assume ~270 legal moves on average (3/4 of 361 board positions), we see an equivalent to observing ~8 random moves. (I do not have the data for Shogi.)

Best Answer

Related Solutions

Solved – the Dirichlet equivalent of a Beta (1,1) distribution

Related Question