TL;DR.
The fact that the discount rate is bounded to be smaller than 1 is a mathematical trick to make an infinite sum finite. This helps proving the convergence of certain algorithms.
In practice, the discount factor could be used to model the fact that the decision maker is uncertain about if in the next decision instant the world (e.g., environment / game / process ) is going to end.
For example:
If the decision maker is a robot, the discount factor could be the
probability that the robot is switched off in the next time instant
(the world ends in the previous terminology). That is the reason why the robot is
short sighted and does not optimize the sum reward but the
discounted sum reward.
Discount factor smaller than 1 (In Detail)
In order to answer more precisely, why the discount rate has to be smaller than one I will first introduce the Markov Decision Processes (MDPs).
Reinforcement learning techniques can be used to solve MDPs. An MDP provides a mathematical framework for modeling decision-making situations where outcomes are partly random and partly under the control of the decision maker. An MDP is defined via a state space $\mathcal{S}$, an action space $\mathcal{A}$, a function of transition probabilities between states (conditioned to the action taken by the decision maker), and a reward function.
In its basic setting, the decision maker takes and action, and gets a reward from the environment, and the environment changes its state. Then the decision maker senses the state of the environment, takes an action, gets a reward, and so on so forth. The state transitions are probabilistic and depend solely on the actual state and the action taken by the decision maker. The reward obtained by the decision maker depends on the action taken, and on both the original and the new state of the environment.
A reward $R_{a_i}(s_j,s_k)$ is obtained when taking action $a_i$ in state $s_j$ and the environment/system changes to state $s_k$ after the decision maker takes action $a_i$. The decision maker follows a policy, $\pi$ $\pi(\cdot):\mathcal{S}\rightarrow\mathcal{A}$, that for each state $s_j \in \mathcal{S}$ takes an action $a_i \in \mathcal{A}$. So that the policy is what tells the decision maker which actions to take in each state. The policy $\pi$ may be randomized as well but it does not matter for now.
The objective is to find a policy $\pi$ such that
\begin{equation} \label{eq:1}
\max_{\pi:S(n)\rightarrow a_i} \lim_{T\rightarrow \infty } E \left\{ \sum_{n=1}^T \beta^n R_{x_i}(S(n),S(n+1)) \right\} (1),
\end{equation}
where $\beta$ is the discount factor and $\beta<1$.
Note that the optimization problem above, has infinite time horizon ($T\rightarrow \infty $), and the objective is to maximize the sum $discounted$ reward (the reward $R$ is multiplied by $\beta^n$).
This is usually called an MDP problem with a infinite horizon discounted reward criteria.
The problem is called discounted because $\beta<1$. If it was not a discounted problem $\beta=1$ the sum would not converge. All policies that have obtain on average a positive reward at each time instant would sum up to infinity. The would be a infinite horizon sum reward criteria, and is not a good optimization criteria.
Here is a toy example to show you what I mean:
Assume that there are only two possible actions $a={0,1}$ and that the reward function $R$ is equal to $1$ if $a=1$, and $0$ if $a=0$ (reward does not depend on the state).
It is clear the the policy that get more reward is to take always action $a=1$ and never action $a=0$.
I'll call this policy $\pi^*$. I'll compare $\pi^*$ to another policy $\pi'$ that takes action $a=1$ with small probability $\alpha << 1$, and action $a=0$ otherwise.
In the infinite horizon discounted reward criteria equation (1) becomes $\frac{1}{1-\beta}$ (the sum of a geometric series) for policy $\pi^*$ while for policy $\pi '$ equation (1) becomes $\frac{\alpha}{1-\beta}$. Since $\frac{1}{1-\beta} > \frac{\alpha}{1-\beta}$, we say that $\pi^*$ is a better policy than $\pi '$. Actually $\pi^*$ is the optimal policy.
In the infinite horizon sum reward criteria ($\beta=1$) equation (1) does not converge for any of the polices (it sums up to infinity). So whereas policy $\pi$ achieves higher rewards than $\pi'$ both policies are equal according to this criteria. That is one reason why the infinite horizon sum reward criteria is not useful.
As I mentioned before, $\beta<1$ makes the trick of making the sum in equation (1) converge.
Other optimality criteria
There are other optimality criteria that do not impose that $\beta<1$:
The finite horizon criteria case the objective is to maximize the discounted reward until the time horizon $T$
\begin{equation} \label{eq:2}
\max_{\pi:S(n)\rightarrow a_i} E \left\{ \sum_{n=1}^T \beta^n R_{x_i}(S(n),S(n+1)) \right\},
\end{equation}
for $\beta \leq 1$ and $T$ finite.
In the infinite horizon average reward criteria the objective is
\begin{equation}
\max_{\pi:S(n)\rightarrow a_i} \lim_{T\rightarrow \infty } E \left\{ \sum_{n=1}^T \frac{1}{T} R_{x_i}(S(n),S(n+1)) \right\},
\end{equation}
End note
Depending on the optimality criteria one would use a different algorithm to find the optimal policy. For instances the optimal policies of the finite horizon problems would depend on both the state and the actual time instant. Most Reinforcement Learning algorithms (such as SARSA or Q-learning) converge to the optimal policy only for the discounted reward infinite horizon criteria (the same happens for the Dynamic programming algorithms). For the average reward criteria there is no algorithm that has been shown to converge to the optimal policy, however one can use R-learning which have good performance albeit not good theoretical convergence.
They mostly look the same except that in SARSA we take actual action and in Q Learning we take the action with highest reward.
Actually in both you "take" the actual single generated action $a_{t+1}$ next. In Q learning, you update the estimate from the maximum estimate of possible next actions, regardless of which action you took. Whilst in SARSA, you update estimates based on and take the same action.
This is probably what you meant by "take" in the question, but in the literature, taking an action means that it becomes the value of e.g. $a_{t}$, and influences $r_{t+1}$, $s_{t+1}$.
Are there any theoretical or practical settings in which one should the prefer one over the other?
Q-learning has the following advantages and disadvantages compared to SARSA:
Q-learning directly learns the optimal policy, whilst SARSA learns a near-optimal policy whilst exploring. If you want to learn an optimal policy using SARSA, then you will need to decide on a strategy to decay $\epsilon$ in $\epsilon$-greedy action choice, which may become a fiddly hyperparameter to tune.
Q-learning (and off-policy learning in general) has higher per-sample variance than SARSA, and may suffer from problems converging as a result. This turns up as a problem when training neural networks via Q-learning.
SARSA will approach convergence allowing for possible penalties from exploratory moves, whilst Q-learning will ignore them. That makes SARSA more conservative - if there is risk of a large negative reward close to the optimal path, Q-learning will tend to trigger that reward whilst exploring, whilst SARSA will tend to avoid a dangerous optimal path and only slowly learn to use it when the exploration parameters are reduced. The classic toy problem that demonstrates this effect is called cliff walking.
In practice the last point can make a big difference if mistakes are costly - e.g. you are training a robot not in simulation, but in the real world. You may prefer a more conservative learning algorithm that avoids high risk, if there was real time and money at stake if the robot was damaged.
If your goal is to train an optimal agent in simulation, or in a low-cost and fast-iterating environment, then Q-learning is a good choice, due to the first point (learning optimal policy directly). If your agent learns online, and you care about rewards gained whilst learning, then SARSA may be a better choice.
Best Answer
The discount factor doesn't really have a well founded interpretation as far as I know. It seems to have been introduced primarily so that the problem is more mathematically or computationally well-behaved. People have interpreted it as a "life-span risk" factor, (i.e $\gamma$ is your chance of dying each time-step, so you should weigh anticipated future reward accordingly). Personally I don't really buy it, because this could just be easily built in to the environment itself. Another interpretation is that it mimics human time preferences, but I don't really buy this either -- the point of reinforcement learning isn't really to mimic human behavior. You can see a bit more discussion on these points in the introduction here.
Anyway, if you're willing to accept either of these interpretations, you could say your agent is operating in a highly risky environment, where it has 50 or 90% chance of dying each time step. Or maybe you're trying to learn really impulsive and short-term decision making. Or maybe your "reward" is denominated in some rapidly hyperinflating currency which loses 90% of it's value every time step (but this goes into the interpretation of what "reward" is).
You may also be interested in these two articles: https://arxiv.org/pdf/1910.02140.pdf and https://arxiv.org/pdf/1902.02893.pdf