[Math] Cross-Entropy loss in Reinforcement Learning

machine learning

In the context of supervised learning for classification using neural networks, when we are identifying the performance of an algorithm we can use cross-entropy loss, given by:
$$
L = -\sum_1^n log(\pi (f(x_i))_{y_i})
$$
Where $x_i$ is a vector datapoint, $\pi$ is a softmax function, $f$ is our nerual network, and $y_i$ refers to the correct class for that data point. So the result is that we will get the the $log$ of the normalised network output for the correct class. The smaller L is the more accurate our model will be. So we can view this as a minimisation problem, where we just need to minimise $L$ with respect to the network weights. However, if we now look at how this equation is used in reinforcement learning, where now the network outputs represent an action which is chosen to interact with an environment instead of a class, we get something like this:
$$
L = -\sum_1^n r_i log(\pi (f(x_i))_{a_i})
$$
Where $r_i$ is a scalar representing the reward at a time step, and $a_i$ is the corresponding chosen action. As we are now in the context of reinforcement learning $a_i$ is used because we don't know what the correct class action is, but by multiplying each network output by the reward, it has the effect where it pushes the gradient in right direction when we perform back propagation to our network.

Although this is how the differences in equations have been explained to me, there is one thing that doesn't quite add up. In the reinforcement learning version of the equation, $L$ now increases when the agent performs well, and decreases when the agent performes badly. Where a negative reward means punishment. Which makes me think this has become a maximisation problem, however, all the implementations I've seen still treat the equation as a minimisation problem. Why is that? Have I misunderstood the maths?

Best Answer

In the reinforcement learning algorithm, you are trying to maximize the expected reward under the policy. When you take the derivative in a stochastic, sampling context, this corresponds to maximizing the log probability of the action you actually took, times the returns/rewards that you received (see the REINFORCE algorithm, policy gradient, and score function estimators in general).

Note that the cross-entropy loss has a negative sign in front. Take the negative away, and maximize instead of minimizing. Now you are maximizing the log probability of the action times the reward, as you want.

So, minimizing the cross-entropy loss is equivalent to maximizing the probability of the target under the learned distribution.

Related Solutions

[Math] Implementing temporal difference learning for a random walk in Python

I think you're double-counting on the update_value_table_future_reward function when you reach a terminal state.

def update_value_table_future_reward(self, old_position, new_position):
    if self.nodes[new_position] in ("Left","Right"):
        reward = self.get_reward(new_position)
    else:
        reward = self.get_reward(new_position) + self.gamma * self.values[new_position]
    self.values[old_position] += self.alpha * (reward - self.values[old_position])

After 5,000 episodes, I get the below values:

[ 0.163 0.357 0.557 0.67 0.852]

[Math] the motivation for using cross-entropy to compare two probability vectors

Let me try with the following three-step reasoning process.

To measure probability value difference

Intuitively, what is best way to measure difference between two probability values?

The probability of a person's death is related to car accident is about $\frac{1}{77}$, and the odds of one stricken by lightening is about $\frac{1}{700,000}$. Their numerical difference (in terms of L2) is around 1%. Do you consider the two events similarly likely? Most people in this case might consider the two events are very different: the first type of events is rare but significant and worthy of attention, while most would not worry about the second type of events in their normal days.

Overall, the sun shines 72% of the time in San Jose, and about 66% of the time on the sunny side (bay side) of San Francisco. The two sun shine probabilities differ numerically by about 6%. Do you consider the difference significant? For some, it might be; but or me, both places get plenty of sun shine, and there is little material difference.

The take away is that we need to measure individual probability value difference not by subtraction, but by some sort of quantities related to their ratio $\frac{p_k}{q_k}$.

But there are problems with using ratio as the measurement quantity. One problem is that it could vary a lot, especially for rare events. It is not uncommon for one to assess a certain probability to be 1% the first day, and declare it to be 2% the second day. Taking a simple ratio of the probability values to probability value of another event would lead to the measurements to change by 100% between the two days. For this reason, the log of ratio $\ log(\frac{p_k}{q_k})$ is used for measuring difference between individual pair of probability values.

To measure probability distribution difference

The goal of your question is to measure the distance between two probability distributions, not two individual probability value points. For a probability distribution, we are talking about multiple probability value points. To most people, it should makes sense to first compute the difference at each probability value point, and then to take their average (weighted by their probability values, i.e. $p_k log(\frac{p_k}{q_k})$) as the distance between two probability distributions.

This leads to our first formula for measuring distribution differences. $$ D_{KL}(p \Vert q) = \sum_{k=1}^n p_k log\left( \frac{p_k}{q_k} \right). $$ This distance measure, called KL-divergence, (not a metric) is usually much better than L1/L2 distances, especially in the realm of Machine Learning. I hope, by now, you would agree that KL-divergence is a natural measure for probability distribution differences.

Finally the cross-entropy measure

There are two technical facts one needs to be aware.

First, KL-divergence and cross entropy is related by the following formula. $$ D_{KL}(p \Vert q) = H(p, q) - H(p). $$

Second, in ML practice, we often pass the ground truth label as the $p$ parameter and the model inference outputs as the $q$ parameter. And in majority of the cases, our training algorithms are based on gradient descent. If both of our assumptions are true (most likely), the term $H(p)$ term is a constant that does not affect our training results, and hence can be discarded to save computational resources. In this case, $H(p,q)$, the cross-entropy, can be used in place of $D_{KL}(p \Vert q)$.

If the assumptions are violated, you need to abandon the cross-entropy formula and revert back to the KL-divergence.

I think I can now end my wordy explanation. I hope it helps.

Best Answer

Related Solutions

[Math] Implementing temporal difference learning for a random walk in Python

[Math] the motivation for using cross-entropy to compare two probability vectors

Related Question