Solved – Understanding the role of the discount factor in reinforcement learning

machine learningreinforcement learning

I'm teaching myself about reinforcement learning, and trying to understand the concept of discounted reward. So the reward is necessary to tell the system which state-action pairs are good, and which are bad. But what I don't understand is why the discounted reward is necessary. Why should it matter whether a good state is reached soon rather than later?

I do understand that this is relevant in some specific cases. For example, if you are using reinforcement learning to trade in the stock market, it is more beneficial to make profit sooner rather than later. This is because having that money now allows you to do things with that money now, which is more desirable than doing things with that money later.

But in most cases, I don't see why the discounting is useful. For example, let's say you wanted a robot to learn how to navigate around a room to reach the other side, where there are penalties if it collides with an obstacle. If there was no discount factor, then it would learn to reach the other side perfectly, without colliding with any obstacles. It may take a long time to get there, but it will get there eventually.

But if we give a discount to the reward, then the robot will be encouraged to reach the other side of the room quickly, even if it has to collide with objects along the way. This is clearly not a desirable outcome. Sure, you want the robot to get to the other side quickly, but not if this means that it has to collide with objects along the way.

So my intuition is that any form of discount factor, will actually lead to a sub-optimal solution. And the choice of the discount factor often seems arbitrary — many methods I have seen simply set it to 0.9. This appears to be very naive to me, and seems to give an arbitrary trade-off between the optimum solution and the fastest solution, whereas in reality this trade-off is very important.

Please can somebody help me to understand all this? Thank you ðŸ™‚

TL;DR.

The fact that the discount rate is bounded to be smaller than 1 is a mathematical trick to make an infinite sum finite. This helps proving the convergence of certain algorithms.

In practice, the discount factor could be used to model the fact that the decision maker is uncertain about if in the next decision instant the world (e.g., environment / game / process ) is going to end.

For example:

If the decision maker is a robot, the discount factor could be the probability that the robot is switched off in the next time instant (the world ends in the previous terminology). That is the reason why the robot is short sighted and does not optimize the sum reward but the discounted sum reward.

Discount factor smaller than 1 (In Detail)

In order to answer more precisely, why the discount rate has to be smaller than one I will first introduce the Markov Decision Processes (MDPs).

Reinforcement learning techniques can be used to solve MDPs. An MDP provides a mathematical framework for modeling decision-making situations where outcomes are partly random and partly under the control of the decision maker. An MDP is defined via a state space $$\mathcal{S}$$, an action space $$\mathcal{A}$$, a function of transition probabilities between states (conditioned to the action taken by the decision maker), and a reward function.

In its basic setting, the decision maker takes and action, and gets a reward from the environment, and the environment changes its state. Then the decision maker senses the state of the environment, takes an action, gets a reward, and so on so forth. The state transitions are probabilistic and depend solely on the actual state and the action taken by the decision maker. The reward obtained by the decision maker depends on the action taken, and on both the original and the new state of the environment.

A reward $$R_{a_i}(s_j,s_k)$$ is obtained when taking action $$a_i$$ in state $$s_j$$ and the environment/system changes to state $$s_k$$ after the decision maker takes action $$a_i$$. The decision maker follows a policy, $$\pi$$ $$\pi(\cdot):\mathcal{S}\rightarrow\mathcal{A}$$, that for each state $$s_j \in \mathcal{S}$$ takes an action $$a_i \in \mathcal{A}$$. So that the policy is what tells the decision maker which actions to take in each state. The policy $$\pi$$ may be randomized as well but it does not matter for now.

The objective is to find a policy $$\pi$$ such that

$$$$\label{eq:1} \max_{\pi:S(n)\rightarrow a_i} \lim_{T\rightarrow \infty } E \left\{ \sum_{n=1}^T \beta^n R_{x_i}(S(n),S(n+1)) \right\} (1),$$$$ where $$\beta$$ is the discount factor and $$\beta<1$$.

Note that the optimization problem above, has infinite time horizon ($$T\rightarrow \infty$$), and the objective is to maximize the sum $$discounted$$ reward (the reward $$R$$ is multiplied by $$\beta^n$$). This is usually called an MDP problem with a infinite horizon discounted reward criteria.

The problem is called discounted because $$\beta<1$$. If it was not a discounted problem $$\beta=1$$ the sum would not converge. All policies that have obtain on average a positive reward at each time instant would sum up to infinity. The would be a infinite horizon sum reward criteria, and is not a good optimization criteria.

Here is a toy example to show you what I mean:

Assume that there are only two possible actions $$a={0,1}$$ and that the reward function $$R$$ is equal to $$1$$ if $$a=1$$, and $$0$$ if $$a=0$$ (reward does not depend on the state).

It is clear the the policy that get more reward is to take always action $$a=1$$ and never action $$a=0$$. I'll call this policy $$\pi^*$$. I'll compare $$\pi^*$$ to another policy $$\pi'$$ that takes action $$a=1$$ with small probability $$\alpha << 1$$, and action $$a=0$$ otherwise.

In the infinite horizon discounted reward criteria equation (1) becomes $$\frac{1}{1-\beta}$$ (the sum of a geometric series) for policy $$\pi^*$$ while for policy $$\pi '$$ equation (1) becomes $$\frac{\alpha}{1-\beta}$$. Since $$\frac{1}{1-\beta} > \frac{\alpha}{1-\beta}$$, we say that $$\pi^*$$ is a better policy than $$\pi '$$. Actually $$\pi^*$$ is the optimal policy.

In the infinite horizon sum reward criteria ($$\beta=1$$) equation (1) does not converge for any of the polices (it sums up to infinity). So whereas policy $$\pi$$ achieves higher rewards than $$\pi'$$ both policies are equal according to this criteria. That is one reason why the infinite horizon sum reward criteria is not useful.

As I mentioned before, $$\beta<1$$ makes the trick of making the sum in equation (1) converge.

Other optimality criteria

There are other optimality criteria that do not impose that $$\beta<1$$:

The finite horizon criteria case the objective is to maximize the discounted reward until the time horizon $$T$$ $$$$\label{eq:2} \max_{\pi:S(n)\rightarrow a_i} E \left\{ \sum_{n=1}^T \beta^n R_{x_i}(S(n),S(n+1)) \right\},$$$$

for $$\beta \leq 1$$ and $$T$$ finite.

In the infinite horizon average reward criteria the objective is $$$$\max_{\pi:S(n)\rightarrow a_i} \lim_{T\rightarrow \infty } E \left\{ \sum_{n=1}^T \frac{1}{T} R_{x_i}(S(n),S(n+1)) \right\},$$$$

End note

Depending on the optimality criteria one would use a different algorithm to find the optimal policy. For instances the optimal policies of the finite horizon problems would depend on both the state and the actual time instant. Most Reinforcement Learning algorithms (such as SARSA or Q-learning) converge to the optimal policy only for the discounted reward infinite horizon criteria (the same happens for the Dynamic programming algorithms). For the average reward criteria there is no algorithm that has been shown to converge to the optimal policy, however one can use R-learning which have good performance albeit not good theoretical convergence.