There are no general rules covering deploying all possible RL agents to production, as there is a huge variety of RL code and approaches.
However, in your case, you have identified key issues that help make a decision:
You don't require the agent to continue learning in production. You consider it trained and ready to use to make decisions.
The training environment was a simulation, and you have a real environment to deploy to.
In addition, I can identify a further issue that may have an impact, based on your previous questions:
- Your custom environment replicates the challenge of the original problem, with a guess at the behaviour of stresses on the system that the agent is supposed to keep within limits. Therefore, the distribution of state transitions used to train the agent may not match actual the distribution of state transitions in production.
Are my custom environment and agent classes are still required while making the prediction about the action using the saved model?
If you have finished training, then none of the training code is required. That is not quite the same as saying that you don't want the agent class, because it will depend on how that was written. It may be more convenient to use it in production too, but it is not necessary. A cut-down version of the agent could work, or an entirely new piece of code. As you will see below, if you have a network that can estimate $q(s,a)$, there is very little extra code you need to make it select actions.
The custom environment will not be required.
If not, what is the best way to move RL models into production?
There is no "best way", as things will depend critically on how important success and failure scenarios are to you.
Your trained model estimates the action-value function $\hat{q}(s,a, \theta)$ with the stored weights in the .h5
file being $\theta$. You will need to load a copy of the model into a NN in production. Once this is done, you will have the ability to predict action values, and you can use that to drive a simple greedy deterministic policy.
The code for the policy should implement the greedy action selector:
$$\pi(s) = \text{argmax}_a \hat{q}(s,a,\theta)$$
and this might be no more complex than the following Python, assuming you have Numpy and a Keras model loaded:
action = np.argmax( model.predict([state])[0] )
You will also need to do the following:
Have a way to read state values and input them to the code running the agent. This might be automated sensors, or could just be someone taking a reading off a dial or scale and typing the results into a prompt.
Have a way to actually take the action in the environment. Again, this might be automated, or have humans acting as proxies. It doesn't matter as long as the action is taken according to the policy.
The difference in state transition distributions between simulation and reality might cause problems for you. This depends critically on the nature of the problem you are solving, and not much can be said about that in general:
You should expect the Q values predicted by you network to not match actual returns seen in production.
However, it is possible for an agent to act optimally even with mismatched distributions, depending on the task. For that, you need the policy to be optimal in each state, which works if relative ordering of $q(s,a)$ is correct, regardless of absolute accuracy.
To address this in the longer term though, you should keep records of state transitions that occur in production and look at modelling them more accurately in a future iteration of your simulated environment once you have enough data to make a more robust model.
Let us suppose that I have a dataset with states associated to the desirable action for that state. And let us suppose that my dataset is big and represents in a good way the state space. With this dataset we could use supervised learning for learning a mapping between states and actions, rigth?
Yes. If available, this will learn an approximation of the policy function from your dataset.
Reinforcement learning (RL) is for when you do not have such a complete and finished dataset, with the answers of how the agent should act in every circumstance. Instead you typically have the definition of an environment, such as the rules of a game, or the controls and sensor inputs from a robot, and the problem is to figure out immediate behaviours that lead to a desired goal. The best action to take in any given situation that leads to a longer-term goal is often not obvious.
RL provides a mechanism to learn from trial and error
Can I convert a typical reinforcement learning problem to a supervised learning problem?
No, unless you already have the dataset that you suggest.
However, knowledge of supervised learning is applied within RL frameworks. Most "Deep RL" which combines RL with neural networks can be thought of as an outer RL algorithm that generates training data (the outcomes of behaviour chosen to test outcomes whilst improving performance towards reaching an optimal policy), combined with an inner supervised learning mechanism (generalising from that observation data to help improve performance in yet unseen situations).
In some, simpler, problems you could use RL techniques or searches to generate a whole dataset for supervised learning, like separate stages of a pipeline. For example, you could perform a tree search from every state in tic tac toe to determine the optimal actions, save that to a dataset, and learn a policy function from it. It may help you understand the role of RL if you think of an approach like that as one extreme of a continuum where at one end RL and supervised learning parts are entirely separate stages, and at the other end, RL is directly learning online from every observation with little or no supervised learning techniques required. Deep RL fits somewhere in the middle.
I'm not sure if I've understood correctly the whole point of reinforcement learning.
If you have a complete and accurate dataset that describes the optimal solution to a control problem, then using RL may be inefficient. However, in practice, that's a big if. To take a modern example of successful application of RL, where would you get this dataset for the game of Go?
In very many real-world problems with action choices, we do not have access to instructions on how to choose optimally. This is where RL fits into the broader machine learning toolkit, it provides a general mechanism for finding solutions to problems of finding optimal solutions to control problems via trial and error.
There may be alternatives to RL in those cases -
- operations research (OR) topics may overlap (when they do, OR will often be better choice)
- genetic algorithms
- "classic" optimal control based on solving differential equations of system state
- planning with simulation and search trees (there are many similarities between planning algorithms and RL, they might be considered variations on the same theme)
However, RL has been demonstrated as a strong contender in many areas where it delivers state-of-the-art results, beating other approaches. A typical example would be learning to play Atari computer games.
Best Answer
I've seen a number of different tactics used in different projects.
If it's possible, (2) seems like the preferred option: if you have the ability to control how the environment relays its data, then you can just solve the problem directly.
If that doesn't work, then using layer normalization seems sufficient, since it will dynamically update to reflect the mean and variance of the incoming data.
Ignoring scaling seems risky to me, since it's well-known that scaling can dramatically improve the learning process in supervised settings.