Solved – Is a policy always deterministic in reinforcement learning

deterministic-policyreinforcement learningstochastic-policy

In reinforcement learning, is a policy always deterministic, or is it a probability distribution over actions (from which we sample)? If the policy is deterministic, why is not the value function, which is defined at a given state for a given policy $\pi$ as follows

$$V^{\pi}(s) = E\left[\sum_{t>0} \gamma^{t}r_t|s_0 = s, \pi\right]$$

a point output?

In the above definition, we take an expectation. What is this expectation over?

Can a policy lead to different routes?

Best Answer

There are multiple questions here: 1. Is a policy always deterministic? 2. If the policy is deterministic then shouldn't the value also be deterministic? 3. What is the expectation over in the value function estimate? Your last question is not very clear "Can a policy lead to routes that have different current values?" but I think you mean: 4. Can a policy lead to different routes?

A policy is a function can be either deterministic or stochastic. It dictates what action to take given a particular state. The distribution $\pi(a\mid s)$ is used for a stochastic policy and a mapping function $\pi:S \rightarrow A$ is used for a deterministic policy, where $S$ is the set of possible states and $A$ is the set of possible actions.
The value function is not deterministic. The value (of a state) is the expected reward if you start at that state and continue to follow a policy. Even if the policy is deterministic the reward function and the environment might not be.
The expectation in that formula is over all the possible routes starting from state $s$. Usually, the routes or paths are decomposed into multiple steps, which are used to train value estimators. These steps can be represented by the tuple $(s,a,r,s')$ (state, action, reward, next state)
This is related to answer 2, the policy can lead to different paths (even a deterministic policy) because the environment is usually not deterministic.

Best Answer

Related Solutions

Solved – Is the policy function $\pi$ in Reinforcement learning a random variable

Reinforcement Learning – Understanding the Policy Improvement Theorem

Related Question