Solved – How to correctly compute $\rho$ in reinforcement learning with importance sampling

importance-samplingq-learningreinforcement learning

This question regards off-policy reinforcement learning using importance sampling.

From the paper Off-Policy Temporal Difference Learning with Function Approximation, page 3:

$\rho_t = \frac{\pi(s_t,a_t)}{b(s_t,a_t)}$

where $\rho_t$ is the importance sampling ratio for time $t$, which will get multiplied by the $\alpha$ step factor / learning rate during parameter updating.

From what I understand, $\pi(s_t,a_t)$ is the probability of selecting action $a_t$ in state $s_t$ according to the target policy $\pi$. Assuming that $\pi$ is the greedy policy (and here I'm probably wrong), wouldn't this value always be 0 (exploration action) or 1 (greedy action)? Meaning that the algorithm would never learn when performing exploratory actions?

It seems I don't quite understand how to compute $\rho$ by not understanding what exactly is being computed by that ratio. How are $\pi(s_t,a_t)$ and $b(s_t,a_t)$ (and consequently $\rho$) really computed?

Best Answer

The answer is tucked in the abstract (emphasis mine): "We prove that, given training under any $\epsilon$-soft policy [...] to the action-value function for an arbitrary target policy."

The authors also write, "We always use $\pi$ to denote the target policy, the policy that we are learning about."

Epsilon-soft is, as defined here, synonymous with epsilon-greedy:

An epsilon-soft policy chooses one action with probability $(1 - \epsilon + \frac{\epsilon}{|A(s)|)}$ and all other actions with probability $\frac{\epsilon}{|A(s)|)}$, where $|A(s)|$ is the number of actions in state $s$. This is equivalent to saying that the policy is to choose a random action with some small probability epsilon, and another single (dominant) action otherwise.

Wiki defines epsilon-greedy similarly:

One such method is $\epsilon$-greedy, when the agent chooses the action that it believes has the best long-term effect with probability $1-\epsilon$, and it chooses an action uniformly at random, otherwise.

That is, there are no restrictions on the target policy $\pi$; $b$ is epsilon-soft.

Related Question