Solved – Is the policy function $\pi$ in Reinforcement learning a random variable

I was reading Andrew Ng's lecture notes on reinforcement learning and on page 3 he defines the value function:

$$V^{\pi}(s) = E[R(s_0) + \gamma R(s_1)+\gamma^2 R(s_2) + … |s_0 = s, \pi]$$

Which means the expected total pay off, given that we start on state s and execute policy $\pi$. However, in his footnotes he says that this notation is a little "sloppy" because he says that $\pi$ isn't technically a random variable (even though the notation implies that both $\pi$ and s are random variables, though it makes sense that s is a r.v. since we don't always know for sure what state we will end up after doing some action a).

My question is, why isn't $\pi$ a random variable? If its not a random variable is it in a reinforcement learning that we are just looking for some policy that is the "best" and was chosen by "nature" somehow? Does it mean that we are not allowed to have some prior belief on which $\pi$ might be true? Is reinforcement learning or at least, the value function here restrained to a frequentist point of view?

Would a better notation for that equation be:

$$V^{\pi}(s) = E[R(s_0) + \gamma R(s_1)+\gamma^2 R(s_2) + … |s_0 = s; \pi]$$

These were my thoughts so far:

$\pi$ is the policy function, its a function that maps states deterministically to actions $\pi(s) = a$. However, I didn't really see why reinforcement learning had to be restricted to a frequentist interpretation. I seemed reasonable to me that $\pi$ could be a r.v. and we instead try to execute the expected policy over all policies or something along those lines (I am not trying to make this idea too precise, but hopefully the idea/concept makes sense). Is it just that Andrew Ng is introducing the concepts of reinforcement learning first a frequentist, as it might be the easiest to understand?

Best Answer

This has nothing to do with frequentism. When the policy $\pi$ creates a distribution over actions, it is called a stochastic policy.

Originally, policies were not stochastic since they were defined as mapping to the highest-value action. The actually policy that is followed, in say an $\epsilon$-greedy approach is to disobey that deterministic policy and act randomly with probability $\epsilon$.

Best Answer

Related Solutions

Solved – Why is the optimal policy in Markov Decision Process (MDP), independent of the initial state

Solved – Is a policy always deterministic in reinforcement learning

Related Question