Machine Learning – Implementing Policy Gradient Reinforcement Learning in Continuous State and Action Space

artificial intelligencemachine learningneural networks

I am a novice in the field of machine learning, I have a moderate level understanding of linear/non-linear regression, support vector machines, neural networks, and q-learning (for discrete finite space and action space). Recently, I was reading a paper titled "User Scheduling and Resource Allocation in HetNets With Hybrid Energy Supply: An Actor-Critic Reinforcement Learning Approach" published in IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 17, NO. 1.

Although the paper is on actor-critic reinforcement learning. However, what I cannot understand is the portion of "policy gradient". Specifically, if our action space is continuous and a vector e.g [b1, b2, b3.. bn, c1, c2, c3…cn] counts as an action where $b1, b2, b3.. bn, c1, c2, c3…cn$ are all continuous variables, same goes for state space. As in the paper mentioned above, the authors have considered Gaussian policy.

$$\pi_{\theta}(s,a)=\dfrac{1}{\sqrt{2 \pi\sigma}}\exp\left(-\dfrac{(a-\mu(s))^2}{2 \sigma^2}\right) \tag{1}$$

Although $\mu(s)$ will be a scalar quantity, still $a$ must be a vector, as each action is a vector ([b1, b2, b3.. bn, c1, c2, c3…cn]). Is (1) a multi-dimensional Gaussan distribution?

Further, how to find the state distribution required for policy gradient? If the state space is continuous, shouldn't the probability of being in a particular state be $0$? If not, how can I use the Gaussian policy to find the state distribution? Can someone explain the expression of policy gradient update, am I required to take samples from the continuous states and actions space for the update? Can someone just solve a single iteration for me?

I know from my questions one would feel that I haven't tried to find the answers by myself and I am just trying to dump my burden on someone else. However, trust me I have read many tutorials, many papers, have seen many online lectures and tutorials on youtube but I am still confused. Mainly because most of the tutorials assume that the reader already has a knowledge of how to find $\delta_{\theta} \pi_{s,a}$, how to find the state distribution, etc.

Best Answer

First of all, actor-critic is an advanced type of policy gradient algorithms. The most basic form of policy gradient is called REINFORCE. Since most things in reinforcement learning are built on each other it might be useful to understand the simple form first. A very good explanation of policy gradient can be found in chapter 13 of "Reinforcement Learning: An Introduction" by Sutton and Barto (can be found on the author's website: http://incompleteideas.net/book/the-book-2nd.html). The following explanations expect a general understanding of reinforcement learning. For more information, please refer to the book or the link at the end of this answer.

In policy gradient methods, you have a neural network (usually) that approximates the policy (i.e. "which action should be taken in which state?"). Concretely, this means that given a state $s$, the network predicts a probability distribution $\pi(a|s)$ (i.e. "With what probability should action $a$ be taken in state $s$?"). If you have a discrete set of actions this could be a neural network with a softmax output layer.

If you have continuous actions however (like in your case), you can use a Gaussian policy. Let's for now assume that your actions are scalars (i.e. 1-dimensional). Here, the network cannot output the probabilities for all actions directly (they are continuous and thus there are infinitely many) but instead outputs the parameters of the Gaussian policy: mean $\mu(s)$ and (sometimes) standard deviation $\sigma(s)$. Equation (1) then tells you how likely different actions should be (see the definition of a normal distribution). To execute this policy, you sample an action from the Gaussian distribution defined by (1). You can also do the same for $d$-dimensional actions if you move to $d$ Gaussians (or a $d$-dimensional multivariate Gaussian distribution with a diagonal covariance matrix. This means that all $d$ one-dimensional Gaussian distributions are independent of each other).

Policy gradient methods learn the parameters of the neural network. You do this by playing a game/episode with your current policy. That is you start in some state $s_0$ given by the environment. You then predict $\pi(a|s_0)$ using the policy neural network. You then sample action $a_0$ from that probability distribution and execute it. The environment gives you reward $r_1$ and you end up in state $s_1$. You repeat this until you reach a terminal state. Now you can update your network parameters $\theta \leftarrow \theta + \alpha G \nabla\ln\pi(a_t|s_t)$ for all time steps $t$. Note that $\alpha$ is the learning rate and $G$ is the return, i.e. the amount of reward you expect to gain in the future after performing action $a_t$ in state $s_t$. This can either be calculated from the episode you just played or estimated by the critic in actor-critic (note that actor-critic works slightly different and has additional updates for learning the critic; please see the book chapter for more details).

Note that we haven't seen how to calculate the gradient $\nabla\ln\pi(a_t|s_t)$ before. That is, we need to know the gradient of the Gaussian policy to update the network. The book comes to our rescue again: Chapter 13.7 and exercise 13.4 (at the end of chapter 13.7) give more details on how to perform policy gradient methods with continuous action spaces and how to calculate the gradient given the parameters and the action that was sampled.

"Spinning Up" by OpenAI might be another good resource for reading up on reinforcement learning and policy gradient methods: https://spinningup.openai.com/en/latest/spinningup/rl_intro.html

I hope that this answer gives some context to your questions and guides you in learning more. Sadly, there is not enough space here to explain every detail.

Best Answer

Related Solutions

Machine Learning – Understanding Softmax Policy Parametrization with Binary State Features

Related Question