Reinforcement Learning – When to Use a Low Discount Factor in Reinforcement Learning

hyperparameterreinforcement learning

In reinforcement learning, we're trying to maximize long-term rewards weighted by a discount factor $\gamma$:
$
\sum_{t=0}^\infty \gamma^t r_t
$
.

$\gamma$ is in the range $[0,1]$, where $\gamma=1$ means a reward in the future is as important as a reward on the next time step and $\gamma=0$ means that only the reward on the next time step is important. Formally, $\gamma$ is given as part of the problem, but this isn't the case in practice where choices must be made on how to build the states, actions, and rewards of the MDP out of real world information.

In my experience (which is far from comprehensive), the value of $\gamma$ used is typically high, such as 0.9 or 0.99 or 0.999. (Or simply 1.0 if we are restricted by a finite time horizon.) But this seems mostly arbitrary.

My question is: when might we use a low, but non-zero value for $\gamma$, such as 0.5 or 0.1?

I'm asking mostly out of curiosity, the question occurred to me and I thought I'd see whether any of you had seen something like this before.

The intuitive answer would be that $\gamma$ is low when the immediate rewards are much more important than long-term rewards, but that's strange. What environment could you be in where you still care about the future, but not that much? What kind of policy would you learn in an environment like that?

Best Answer

The discount factor doesn't really have a well founded interpretation as far as I know. It seems to have been introduced primarily so that the problem is more mathematically or computationally well-behaved. People have interpreted it as a "life-span risk" factor, (i.e $\gamma$ is your chance of dying each time-step, so you should weigh anticipated future reward accordingly). Personally I don't really buy it, because this could just be easily built in to the environment itself. Another interpretation is that it mimics human time preferences, but I don't really buy this either -- the point of reinforcement learning isn't really to mimic human behavior. You can see a bit more discussion on these points in the introduction here.

Anyway, if you're willing to accept either of these interpretations, you could say your agent is operating in a highly risky environment, where it has 50 or 90% chance of dying each time step. Or maybe you're trying to learn really impulsive and short-term decision making. Or maybe your "reward" is denominated in some rapidly hyperinflating currency which loses 90% of it's value every time step (but this goes into the interpretation of what "reward" is).

You may also be interested in these two articles: https://arxiv.org/pdf/1910.02140.pdf and https://arxiv.org/pdf/1902.02893.pdf

Related Question