This has nothing to do with frequentism. When the policy $\pi$ creates a distribution over actions, it is called a stochastic policy.
Originally, policies were not stochastic since they were defined as mapping to the highest-value action. The actually policy that is followed, in say an $\epsilon$-greedy approach is to disobey that deterministic policy and act randomly with probability $\epsilon$.
Reinforcement learning is formulated as a problem with states, actions, and rewards, with transitions between states affected by the current state, chosen action and the environment. That is part of its definition (formulated as a Markov Decision Process), so generally you won't find stateless variants of it that are still called reinforcement learning.
However, there are related stateless problems. There are multi-armed bandits, which just have actions and rewards. Solutions for these allow for learning of reward based on actions, and can be optimised by selecting the best action (being sure you have the best one, and what total reward you can accumulate whilst still testing for which is the best action are the main optimisation problems). Your button pushing example looks a lot like a multi-armed bandit problem. Another common example might be advert selection online for anonymous visitors to a site - although there is often plenty of data available, there's also a huge amount that is hidden, and a practical approach is to treat the probability of a click through as only depending on the choice of content, which is then the site's action.
There is a "stateful" variant of multi-armed bandits called contextual bandits - when there is some kind of signal that can be associated with the correct action, but actions taken have no effect on what the next state will be. Contextual bandits have states, actions and rewards, but no transition rules, they can be treated as a set of entirely separate events.
A contextual bandit with added transition rules between states, but no influence from the selected action, is essentially a sub-class of the reinforcement learning problem, and you can use most of the same analysis to predict long term reward and learn optimal behaviour.
And just for completeness, there are Markov Reward Processes without agent interaction that have states, and rewards, but no actions. It is possible to use reinforcement learning algorithms on these to predict long term reward and/or the expected long term value of being in a specific state.
The rock-paper-scissors game does not fit neatly into any of the above problems because there are two agents (although you could analyse it as a multi-armed bandit if the opponent's policy was fixed to always play with specific unchanging probabilities, or as a contextual bandit if there multiple such opponents or a really easy "tell" to their style of play that never changed). Typically a game of rock-paper-scissors would be analysed using game theory - it has an interesting feature that the Nash equilibrium is achieved by using a stochastic policy with equal $\frac{1}{3}$ probability of each choice.
If you wrote a rock-paper-scissors agent to play against a human opponent, you might actually formulate it as a reinforcement learning problem, taking the last N plays as the state, because that could learn to take advantage of human players' poor judgement of randomness.
Best Answer
Yes. If available, this will learn an approximation of the policy function from your dataset.
Reinforcement learning (RL) is for when you do not have such a complete and finished dataset, with the answers of how the agent should act in every circumstance. Instead you typically have the definition of an environment, such as the rules of a game, or the controls and sensor inputs from a robot, and the problem is to figure out immediate behaviours that lead to a desired goal. The best action to take in any given situation that leads to a longer-term goal is often not obvious.
RL provides a mechanism to learn from trial and error
No, unless you already have the dataset that you suggest.
However, knowledge of supervised learning is applied within RL frameworks. Most "Deep RL" which combines RL with neural networks can be thought of as an outer RL algorithm that generates training data (the outcomes of behaviour chosen to test outcomes whilst improving performance towards reaching an optimal policy), combined with an inner supervised learning mechanism (generalising from that observation data to help improve performance in yet unseen situations).
In some, simpler, problems you could use RL techniques or searches to generate a whole dataset for supervised learning, like separate stages of a pipeline. For example, you could perform a tree search from every state in tic tac toe to determine the optimal actions, save that to a dataset, and learn a policy function from it. It may help you understand the role of RL if you think of an approach like that as one extreme of a continuum where at one end RL and supervised learning parts are entirely separate stages, and at the other end, RL is directly learning online from every observation with little or no supervised learning techniques required. Deep RL fits somewhere in the middle.
If you have a complete and accurate dataset that describes the optimal solution to a control problem, then using RL may be inefficient. However, in practice, that's a big if. To take a modern example of successful application of RL, where would you get this dataset for the game of Go?
In very many real-world problems with action choices, we do not have access to instructions on how to choose optimally. This is where RL fits into the broader machine learning toolkit, it provides a general mechanism for finding solutions to problems of finding optimal solutions to control problems via trial and error.
There may be alternatives to RL in those cases -
However, RL has been demonstrated as a strong contender in many areas where it delivers state-of-the-art results, beating other approaches. A typical example would be learning to play Atari computer games.