In his notes, when you must "estimate them from data", he does not mean the reward function. You rarely estimate the reward function. You typically learn the value function, which estimates the immediate reward plus the temporally-discounted future reward (if the temporal discount is zero, then you are estimating the rewards). Or, you can learn Q values, which are values associated with state-action pairs.
In summary, the reward function and the true transition function is defined by the environment. The agent learns things like the transition function, Q values, and the value function.
My question is, how should I define the value of the terminal state?
The state value of the terminal state in an episodic problem should always be zero. The value of a state is the expected sum of all future rewards when starting in that state and following a specific policy. For the terminal state, this is zero - there are no more rewards to be had.
So if I want to improve my policy by making it greedy in respect to the neighbor states, the states next to the terminal states won't want to choose the terminal state (since there are positive non-terminal states neighboring it)
You have not made it 100% clear here, but I am concerned that you might be thinking the greedy policy is chosen like this: $\pi(s) = \text{argmax}_a [\sum_{s'} p(s'|s, a) v(s') ]$ - where $v(s)$ is your state value function, and $p(s'|s, a)$ is the probability of transition to state $s'$ given starting state $s$ and action $a$ (using same notation as Sutton & Barto, 2nd edition). That is not the correct formula for the greedy action choice. Instead, in order to maximise reward from the next action you take into account the immediate reward plus expected future rewards from the next state (I have added in the commonly-used discount factor $\gamma$ here):
$$\pi(s) = \text{argmax}_a [\sum_{s',r} p(s',r|s, a)(r + \gamma v(s')) ]$$
If you are more used to seeing transition matrix $P_{ss'}^a$ and expected reward matrix $R_{ss'}^a$, then the same formula using those is:
$$\pi(s) = \text{argmax}_a [\sum_{s'} P_{ss'}^a( R_{ss'}^a + \gamma v(s')) ]$$
When you use this greedy action choice, then the action to transition to the terminal state is at least equal value to other choices.
In addition, your specific problem has another issue, related to how you have set the rewards.
I am working in an environment where each transition rewards 0 except for the transitions into the terminal state, which reward 1.
Does this sort of environment just not work with state-value dynamic programming methods of reinforcement learning? I don't see how I can make this work.
Recall that state values are the value of the state only assuming a specific policy. Answers to your problem are going to depend on the type of learning algorithm you use, and whether you allow stochastic or deterministic policies. There should always be at least one state with at least some small chance of transitioning to the terminal state, in order for any value other than 0 for all the other states. That should be guaranteed under most learning algorithms. However, many of these algorithms could well learn convoluted policies which choose not to transition to the terminal state when you would expect/want them to (without knowing your problem definition, I could not say which that is intuitively).
Your biggest issue is that with your reward structure, you have given the agent no incentive to end the episode. Yes it can get a reward of 1, but your reward scheme means that the agent is guaranteed to get that reward eventually whatever it does, there is no time constraint. If you applied a learning algorithm - e.g. Policy Iteration - to your MDP, you could find that all states except the terminal state have value of 1, which the agent will get eventually once it transitions to the terminal state. As long as it learns a policy where that happens eventually, then as far as the agent is concerned, it has learned an optimal policy.
If you want to have an agent that solves your MDP in minimal time, in an episodic problem, it is usual to encode some negative reward for each time step. A basic maze solver for instance, typically gives reward -1 for each time step.
An alternative might be to apply a discount factor $0 \lt \gamma \lt 1$ - which will cause the agent to have some preference for immediate rewards, and this should impact the policy so that the step to the terminal state is always taken.
Best Answer
There are no general rules covering deploying all possible RL agents to production, as there is a huge variety of RL code and approaches.
However, in your case, you have identified key issues that help make a decision:
You don't require the agent to continue learning in production. You consider it trained and ready to use to make decisions.
The training environment was a simulation, and you have a real environment to deploy to.
In addition, I can identify a further issue that may have an impact, based on your previous questions:
If you have finished training, then none of the training code is required. That is not quite the same as saying that you don't want the agent class, because it will depend on how that was written. It may be more convenient to use it in production too, but it is not necessary. A cut-down version of the agent could work, or an entirely new piece of code. As you will see below, if you have a network that can estimate $q(s,a)$, there is very little extra code you need to make it select actions.
The custom environment will not be required.
There is no "best way", as things will depend critically on how important success and failure scenarios are to you.
Your trained model estimates the action-value function $\hat{q}(s,a, \theta)$ with the stored weights in the
.h5
file being $\theta$. You will need to load a copy of the model into a NN in production. Once this is done, you will have the ability to predict action values, and you can use that to drive a simple greedy deterministic policy.The code for the policy should implement the greedy action selector:
$$\pi(s) = \text{argmax}_a \hat{q}(s,a,\theta)$$
and this might be no more complex than the following Python, assuming you have Numpy and a Keras model loaded:
You will also need to do the following:
Have a way to read state values and input them to the code running the agent. This might be automated sensors, or could just be someone taking a reading off a dial or scale and typing the results into a prompt.
Have a way to actually take the action in the environment. Again, this might be automated, or have humans acting as proxies. It doesn't matter as long as the action is taken according to the policy.
The difference in state transition distributions between simulation and reality might cause problems for you. This depends critically on the nature of the problem you are solving, and not much can be said about that in general:
You should expect the Q values predicted by you network to not match actual returns seen in production.
However, it is possible for an agent to act optimally even with mismatched distributions, depending on the task. For that, you need the policy to be optimal in each state, which works if relative ordering of $q(s,a)$ is correct, regardless of absolute accuracy.
To address this in the longer term though, you should keep records of state transitions that occur in production and look at modelling them more accurately in a future iteration of your simulated environment once you have enough data to make a more robust model.