TL;DR.
The fact that the discount rate is bounded to be smaller than 1 is a mathematical trick to make an infinite sum finite. This helps proving the convergence of certain algorithms.
In practice, the discount factor could be used to model the fact that the decision maker is uncertain about if in the next decision instant the world (e.g., environment / game / process ) is going to end.
For example:
If the decision maker is a robot, the discount factor could be the
probability that the robot is switched off in the next time instant
(the world ends in the previous terminology). That is the reason why the robot is
short sighted and does not optimize the sum reward but the
discounted sum reward.
Discount factor smaller than 1 (In Detail)
In order to answer more precisely, why the discount rate has to be smaller than one I will first introduce the Markov Decision Processes (MDPs).
Reinforcement learning techniques can be used to solve MDPs. An MDP provides a mathematical framework for modeling decision-making situations where outcomes are partly random and partly under the control of the decision maker. An MDP is defined via a state space $\mathcal{S}$, an action space $\mathcal{A}$, a function of transition probabilities between states (conditioned to the action taken by the decision maker), and a reward function.
In its basic setting, the decision maker takes and action, and gets a reward from the environment, and the environment changes its state. Then the decision maker senses the state of the environment, takes an action, gets a reward, and so on so forth. The state transitions are probabilistic and depend solely on the actual state and the action taken by the decision maker. The reward obtained by the decision maker depends on the action taken, and on both the original and the new state of the environment.
A reward $R_{a_i}(s_j,s_k)$ is obtained when taking action $a_i$ in state $s_j$ and the environment/system changes to state $s_k$ after the decision maker takes action $a_i$. The decision maker follows a policy, $\pi$ $\pi(\cdot):\mathcal{S}\rightarrow\mathcal{A}$, that for each state $s_j \in \mathcal{S}$ takes an action $a_i \in \mathcal{A}$. So that the policy is what tells the decision maker which actions to take in each state. The policy $\pi$ may be randomized as well but it does not matter for now.
The objective is to find a policy $\pi$ such that
\begin{equation} \label{eq:1}
\max_{\pi:S(n)\rightarrow a_i} \lim_{T\rightarrow \infty } E \left\{ \sum_{n=1}^T \beta^n R_{x_i}(S(n),S(n+1)) \right\} (1),
\end{equation}
where $\beta$ is the discount factor and $\beta<1$.
Note that the optimization problem above, has infinite time horizon ($T\rightarrow \infty $), and the objective is to maximize the sum $discounted$ reward (the reward $R$ is multiplied by $\beta^n$).
This is usually called an MDP problem with a infinite horizon discounted reward criteria.
The problem is called discounted because $\beta<1$. If it was not a discounted problem $\beta=1$ the sum would not converge. All policies that have obtain on average a positive reward at each time instant would sum up to infinity. The would be a infinite horizon sum reward criteria, and is not a good optimization criteria.
Here is a toy example to show you what I mean:
Assume that there are only two possible actions $a={0,1}$ and that the reward function $R$ is equal to $1$ if $a=1$, and $0$ if $a=0$ (reward does not depend on the state).
It is clear the the policy that get more reward is to take always action $a=1$ and never action $a=0$.
I'll call this policy $\pi^*$. I'll compare $\pi^*$ to another policy $\pi'$ that takes action $a=1$ with small probability $\alpha << 1$, and action $a=0$ otherwise.
In the infinite horizon discounted reward criteria equation (1) becomes $\frac{1}{1-\beta}$ (the sum of a geometric series) for policy $\pi^*$ while for policy $\pi '$ equation (1) becomes $\frac{\alpha}{1-\beta}$. Since $\frac{1}{1-\beta} > \frac{\alpha}{1-\beta}$, we say that $\pi^*$ is a better policy than $\pi '$. Actually $\pi^*$ is the optimal policy.
In the infinite horizon sum reward criteria ($\beta=1$) equation (1) does not converge for any of the polices (it sums up to infinity). So whereas policy $\pi$ achieves higher rewards than $\pi'$ both policies are equal according to this criteria. That is one reason why the infinite horizon sum reward criteria is not useful.
As I mentioned before, $\beta<1$ makes the trick of making the sum in equation (1) converge.
Other optimality criteria
There are other optimality criteria that do not impose that $\beta<1$:
The finite horizon criteria case the objective is to maximize the discounted reward until the time horizon $T$
\begin{equation} \label{eq:2}
\max_{\pi:S(n)\rightarrow a_i} E \left\{ \sum_{n=1}^T \beta^n R_{x_i}(S(n),S(n+1)) \right\},
\end{equation}
for $\beta \leq 1$ and $T$ finite.
In the infinite horizon average reward criteria the objective is
\begin{equation}
\max_{\pi:S(n)\rightarrow a_i} \lim_{T\rightarrow \infty } E \left\{ \sum_{n=1}^T \frac{1}{T} R_{x_i}(S(n),S(n+1)) \right\},
\end{equation}
End note
Depending on the optimality criteria one would use a different algorithm to find the optimal policy. For instances the optimal policies of the finite horizon problems would depend on both the state and the actual time instant. Most Reinforcement Learning algorithms (such as SARSA or Q-learning) converge to the optimal policy only for the discounted reward infinite horizon criteria (the same happens for the Dynamic programming algorithms). For the average reward criteria there is no algorithm that has been shown to converge to the optimal policy, however one can use R-learning which have good performance albeit not good theoretical convergence.
The discount factor doesn't really have a well founded interpretation as far as I know. It seems to have been introduced primarily so that the problem is more mathematically or computationally well-behaved. People have interpreted it as a "life-span risk" factor, (i.e $\gamma$ is your chance of dying each time-step, so you should weigh anticipated future reward accordingly). Personally I don't really buy it, because this could just be easily built in to the environment itself. Another interpretation is that it mimics human time preferences, but I don't really buy this either -- the point of reinforcement learning isn't really to mimic human behavior. You can see a bit more discussion on these points in the introduction here.
Anyway, if you're willing to accept either of these interpretations, you could say your agent is operating in a highly risky environment, where it has 50 or 90% chance of dying each time step. Or maybe you're trying to learn really impulsive and short-term decision making. Or maybe your "reward" is denominated in some rapidly hyperinflating currency which loses 90% of it's value every time step (but this goes into the interpretation of what "reward" is).
You may also be interested in these two articles: https://arxiv.org/pdf/1910.02140.pdf and https://arxiv.org/pdf/1902.02893.pdf
Best Answer
Kris De Asis wrote that - The discount factor affects how much weight it gives to future rewards in the value function. A discount factor γ=0 will result in state/action values representing the immediate reward, while a higher discount factor γ=0.9 will result in the values representing the cumulative discounted future reward an agent expects to receive (behaving under a given policy). The convergence is influenced by the discount factor depending on whether it’s a continual task or an episodic one. In a continual one, γ must be between [0, 1), whereas an episodic one it can be between [0, 1].