Solved – How to move reinforcement learning model into production

deep learningneural networksreinforcement learning

I have trained reinforcement learning agent on a custom environment using the DQN technique. The custom environment is a simulation of a real production environment.

Now I have trained NN model with in the .h5 file format and want to move this model into production. I do not need it to learn and improve in production, just act according to the policy it learned in the simulated environment. But I don't have clear idea of doing this.

  • Are my custom environment and agent classes are still required while making the prediction about the action using the saved model?

  • If not, what is the best way to move RL models into production?

Best Answer

There are no general rules covering deploying all possible RL agents to production, as there is a huge variety of RL code and approaches.

However, in your case, you have identified key issues that help make a decision:

  • You don't require the agent to continue learning in production. You consider it trained and ready to use to make decisions.

  • The training environment was a simulation, and you have a real environment to deploy to.

In addition, I can identify a further issue that may have an impact, based on your previous questions:

  • Your custom environment replicates the challenge of the original problem, with a guess at the behaviour of stresses on the system that the agent is supposed to keep within limits. Therefore, the distribution of state transitions used to train the agent may not match actual the distribution of state transitions in production.

Are my custom environment and agent classes are still required while making the prediction about the action using the saved model?

If you have finished training, then none of the training code is required. That is not quite the same as saying that you don't want the agent class, because it will depend on how that was written. It may be more convenient to use it in production too, but it is not necessary. A cut-down version of the agent could work, or an entirely new piece of code. As you will see below, if you have a network that can estimate $q(s,a)$, there is very little extra code you need to make it select actions.

The custom environment will not be required.

If not, what is the best way to move RL models into production?

There is no "best way", as things will depend critically on how important success and failure scenarios are to you.

Your trained model estimates the action-value function $\hat{q}(s,a, \theta)$ with the stored weights in the .h5 file being $\theta$. You will need to load a copy of the model into a NN in production. Once this is done, you will have the ability to predict action values, and you can use that to drive a simple greedy deterministic policy.

The code for the policy should implement the greedy action selector:

$$\pi(s) = \text{argmax}_a \hat{q}(s,a,\theta)$$

and this might be no more complex than the following Python, assuming you have Numpy and a Keras model loaded:

action = np.argmax( model.predict([state])[0] )

You will also need to do the following:

  • Have a way to read state values and input them to the code running the agent. This might be automated sensors, or could just be someone taking a reading off a dial or scale and typing the results into a prompt.

  • Have a way to actually take the action in the environment. Again, this might be automated, or have humans acting as proxies. It doesn't matter as long as the action is taken according to the policy.

The difference in state transition distributions between simulation and reality might cause problems for you. This depends critically on the nature of the problem you are solving, and not much can be said about that in general:

  • You should expect the Q values predicted by you network to not match actual returns seen in production.

  • However, it is possible for an agent to act optimally even with mismatched distributions, depending on the task. For that, you need the policy to be optimal in each state, which works if relative ordering of $q(s,a)$ is correct, regardless of absolute accuracy.

To address this in the longer term though, you should keep records of state transitions that occur in production and look at modelling them more accurately in a future iteration of your simulated environment once you have enough data to make a more robust model.

Related Question