I like your approach of starting by defining the items of an MDP (States, Actions, Transition probabilities, and Rewards):
$$\text{MDP} = \langle S,A,T,R,\gamma \rangle$$
However I am not sure if your example is that clear:
For the transition probabilities $T$ you indicate:
I think these are high, I don't expect the DVR to continue recording unless there is a bug, so I suppose I could put something like .99 for each of the transitions?
I do not think there will be a high error rate neither, so the transitions will practically be deterministic (i.e. a state 'probability' of 1.0 for going to the 'next' state).
Your reward seems to be correct: the number of minutes, since you want to maximize the number of minutes recorded.
The states you suggest do not contain enough information to 'solve' the problem. For example the rewards should be defined based on the states and potentially action: $R(s,a,s')$, but with the states you defined this cannot be calculated.
In order to plan based on the scheduling time of a program on your recorders I think you need the whole schedule in the states.
Scheduling with MDPs
Finite horizon MDPs are used to do scheduling, in most works they have time in the state.
M. Dirks wrote an essay (Survey of Dynamic Medical Resource Allocation Methods) about different patient scheduling methods using MDPs. In all the papers they have to overcome the curse of dimensionality, which refers to the exponential growth of search space with the number of states by simplifying the states. Most methods use the count of type of patients: waiting, in treatment 1, in treatment hall 2, emergency case, etc.
The actions are defined as the number of patients to choose for the next time slot. See M. Dirks’ presentation for more details.
L. Zhu et al. (2013) do patient scheduling using number of reserved slots per day as states, and the action consists of the number of patients (per type) accepted for the next day.
Schneider, J.G. et al. (1998) gives a solution to factory scheduling using MDPs. The states contain the time, product inventory, and factory configuration. The actions are factory configurations.
Recording Scheduler MDP
Assuming there are three time slices in a day (to simplify, later you can extend it to 48) I came up with the following model:
States
- $S_\text{time} = \{t_1, t_2, t_3\}$, the time steps in a day
- $S_\text{rec}$ = {Rec1, Rec2}, this is the recorder.
- $S_\text{record_time_left} = \{0,1,2,3\}$ (time slices).
- $S_\text{schema}$, all possible schemas, a subset of: $\{(t_1: 1), (t_1: 2), (t_1: 3), (t_2: 1), (t_2: 2), (t_3,1) \}$, where each tuple $(t_i, n)$ is a requested recording starting at $t_i$ with a duration of $n$.
The final state would be:
$$S = S_\text{time} \times S_\text{rec} \times S_\text{record_time_left} \times S_\text{schema}$$
This gives me, for this problem, $|S|=3 \cdot 2 \cdot 4 \cdot 64 = 1536$ states.
Note that the channel number where to record from is also important, but for simplicity I leave this out.
Actions
Actions are: $a=( (Rec_1,t_{r1},n_1), (Rec_2,t_{r2},n_2) )$ where $Rec_k$ refers to recorder $k$, recording the program that started at $t_k$ with a duration of $n_k$.
Reward
$R(s)$: the sum of time recorded by all recorded in the current state (with timestamp $t_\text{time}$). If Rec1 is recording nothing, and Rec2 has record_time_left=2 then the reward would be $0+1=1$ since in the current state it will only record 1 time step.
Transitions function
The transitions function $P(s’|s,a)$ should now be defined with the following specifications:
- The next state should have a time step higher (i.e. always from $t_i$ to $t_{i+1}$).
- The record_time_left should be reduced for both recorders (with 1 time unit).
- The schema should also be adapted such that it only contains whole programmed recordings, and no past (i.e. before $t_i$).
- And you probably will have to put some probability to have upcoming schedule changes, since the scheduled programming might be changed in the future.
Policy
The found policy should then give you the best action based on the time, state of the recorders (recording or not and for how long), and the requested recording schedule.
Other examples
There are quite some good lectures available on-line with nice examples, here I name just a few:
MDP Diagram
Finally, there are different ways of drawing an MDP diagram:
But the main thing is that you indicate the states as nodes, and transactions as edges with their probabilities, actions, and rewards.
Best Answer
Modeling the system as a Markov chain, the maximum likelihood estimate for the $A \rightarrow B$ transition probability is simply the number of times you saw $B$ following $A$, divided by the number of times you saw any state following $A$. The transition matrix is a matrix $P$ where $P_{ij}$ contains the $i \rightarrow j$ transition probability. A simple way to calculate this is to construct the transition count matrix $N$, where $N_{ij}$ contains the number of times you observed an $i \rightarrow j$ transition. Then, calculate $P$ by dividing each row of $N$ by its sum.