Doubt About Counterfactual Regret Derivation

game theory

Counterfactual regret minimization is currently an important component of state of the art poker playing algorithms. It was introduced in 2007 in Regret Minimization in Games with Incomplete Information by Zinkevich et al.

I'm having trouble verifying one of the crucial results, Theorem 3, which I will restate after giving some definitions.

The context is a perfect recall extensive-form game of incomplete information, repeated for $T$ iterations.
At iteration $t$, we have the strategy profile $\sigma^t$.
$\sigma_i$ denotes the strategy of player $i$ and $\sigma_{-i}$ denotes the strategy of all players except for $i$.
$\Sigma_i$ is the space of behavioral strategies available to player $i$.
$\sigma|_{I\rightarrow a}$ denotes strategy identical to $\sigma$ except that at information set $I$ the player who acts always acts with action $a$.
$u_i(\sigma)$ is the expected payoff for player $i$ under strategy profile $\sigma$.
$A(I)$ are the actions available to the acting player at information set $I$.
$\frak{I}_i$ denotes the information sets of player $i$.
$Z$ is the set of terminal game states.
$\pi^\sigma(h)$ is the probability of reaching game state $h$ under the strategy profile $\sigma$ and $\pi^\sigma(h, h')$ is the probability of going from game state $h$ to $h'$.
$\pi^\sigma(I)$ is the probability that information set $I$ is reached under the strategy profile $\sigma$.
Probabilities $\pi^\sigma$ can be further decomposed to the probability due to player $i$ denoted $\pi^\sigma_i(.)$ and the probability due to players other than $i$ denoted $\pi^\sigma_{-i}(.)$.
$x^+$ denotes $\max(x,0)$.

Define average overall regret of player i at time $T$ as:
$$ R^T_i = \frac 1 T \max_{\sigma_i^* \in \Sigma_i} \sum_{t=1}^T u_i(\sigma_i^*, \sigma_{-i}^t) – u_i(\sigma^t)$$

Define counterfactual utility of information set $I$ under strategy profile $\sigma$ as:
$$ u_i(\sigma, I) = \frac {\sum_{h\in I, h'\in Z} \pi^\sigma_{-i}(h)\pi^\sigma(h, h')u_i(h')}{\pi^\sigma_{-i}(I)}$$
Remark: The above seems to be equivalent to the expected payoff to player $i$ under $\sigma$, given that play reaches information set $I$. Just multiply numerator and denominator by $\pi^\sigma_i(I)$.

Define immediate counterfactual regret at information set $I$ as:
$$ R^T_{i, \text{imm}}(I) = \frac 1 T \max_{a \in A(I)} \sum_{t = 1}^T \pi_{-i}^{\sigma^t}(I)(u_i(\sigma^t|_{I \rightarrow a}) – u_i(\sigma^t, I))$$

The authors claim Theorem 3.
$R_i^T \le \sum_{I\in \frak{I_i}} R^{T+}_{i, \text{imm}}(I)$

Their proof lies in Appendix A.1 and requires some additional definitions.

$D(I)$ are the information sets of player $i$ reachable from $I$ and including $I$.
$\sigma|_{D(I)\rightarrow \sigma'}$ is a strategy profile equal to $\sigma$ except in the information sets $D(I)$ where it is equal to $\sigma'$.
$\text{succ}_i^\sigma(I'|I,a)$ is the probability that $I'$ is the next information set of player $i$ given action $a$ awas selected in information $I$ and $\sigma$ is the current strategy. There is a mysterious qualification here.

If $\sigma$ implies that $I$ is unreachable because of an action of player $i$, that action is changed to allow $I$ to be reachable.

$\text{Succ}_i(I,a)$ is the set of all possible next information sets of player $i$ given $a \in A(I)$ was selected in $I$. $\text{Succ}_i(I) = \cup_{a\in A(I)} \text{Succ}_i(I,a)$.

The full counterfactual regret is defined:
$$R_{i,\text{full}}^T(I) = \frac 1 T \max_{\sigma' \in \Sigma_i} \sum_{t=1}^T \pi_{-i}^{\sigma^t}(I)(u_i(\sigma^t|_{D(I)\rightarrow\sigma'}, I) – u_i(\sigma^t,I))$$

The authors assert equation $(10)$.
$R_{i,\text{full}}^T(I) = \frac 1 T \max_{a \in A(I)} \max_{\sigma' \in \Sigma_i} \sum_{t=1}^T \pi_{-i}^{\sigma^t}(I)[u_i(\sigma^t|_{I\rightarrow a},I) – u_i(\sigma^t, I) + \sum_{I'\in\text{Succ}_i(I,a)} \text{succ}_i^\sigma(I'|I,a)(u_i(\sigma^t|_{D(I)\rightarrow\sigma'},I') – u_i(\sigma^t,I'))]$

I'm just not getting the decomposition here. How can we split off a pure strategy $a \in A(I)$ when the maximization takes place over mixed strategies? Secondly, aren't we double-counting here? For instance, the term $- u_i(\sigma^t, I)$ already incorporates game tree nodes that would occur after taking action $a$ at $I$, yet still we add in the terms $-u_i(\sigma^t, I')$ for every information set $I'$ that is a direct descendent of $I$ after taking action $a$.

Can someone clear up this step for me?

Best Answer

We can derive equation (10) as follows.

$$R_{i,\text{full}}^T(I) = \frac 1 T \max_{\sigma' \in \Sigma_i} \sum_{t=1}^T \pi_{-i}^{\sigma^t}(I)(u_i(\sigma^t|_{D(I)\rightarrow\sigma'}, I) - u_i(\sigma^t,I)) $$

Lemma 1.
Let $\sigma'$ be a strategy for player $i$ whose move at $I$ is the pure move $a$. $$u_i(\sigma^t|_{D(I)\rightarrow\sigma'}, I) = \sum_{I'\in\text{Succ}_i(I,a)}\text{succ}_i^{\sigma^t}(I'|I,a)u_i(\sigma^t|_{D(I)\rightarrow\sigma'},I)$$

This is a result of the special case of Law of total expectation $E[X] = \sum_{A_i} P(A_i)E[X|A_i]$.

Lemma 2. $$ u_i(\sigma^t|_{I\rightarrow a},I) = \sum_{I'\in\text{Succ}_i(I,a)}\text{succ}_i^{\sigma^t}(I'|I,a)u_i(\sigma^t, I')$$

Again this is a consequence of the law of total expectation.

Using Lemma 1 and 2 we can derive equation (10). We can just focus on the terms inside the sum $\sum_{t=1}^T \pi^{\sigma^t}_{-i}(I)(...)$ $$u_i(\sigma^t|_{D(I)\rightarrow\sigma'}, I) - u_i(\sigma^t,I) + 0 =$$

$$u_i(\sigma^t|_{D(I)\rightarrow\sigma'}, I) - u_i(\sigma^t,I) + u_i(\sigma^t|_{I\rightarrow a},I) - u_i(\sigma^t|_{I\rightarrow a},I)=$$

$$u_i(\sigma^t|_{I\rightarrow a},I) - u_i(\sigma^t,I) + u_i(\sigma^t|_{D(I)\rightarrow\sigma'}, I) - u_i(\sigma^t|_{I\rightarrow a},I)=$$

$$u_i(\sigma^t|_{I\rightarrow a},I) - u_i(\sigma^t,I) + \sum_{I'\in\text{Succ}_i(I,a)}\text{succ}_i^{\sigma^t}(I'|I,a)u_i(\sigma^t|_{D(I)\rightarrow\sigma'},I) - \sum_{I'\in\text{Succ}_i(I,a)}\text{succ}_i^{\sigma^t}(I'|I,a)u_i(\sigma^t, I')=$$

$$u_i(\sigma^t|_{I\rightarrow a},I) - u_i(\sigma^t,I) + \sum_{I'\in\text{Succ}_i(I,a)}\text{succ}_i^{\sigma^t}(I'|I,a)(u_i(\sigma^t|_{D(I)\rightarrow\sigma'},I) - u_i(\sigma^t, I'))=$$

This derivation was discussed on Reddit's r/gametheory.

Related Question