Solved – Creating a Markov Decision Process

markov-process

So I have an idea for an MDP for my Machine Learning class, but I'm just wanting to make sure I take the right approach for setting it up. The Key Underlying question here is How do I approach a problem in such a way that I can describe it in a MDP fashion in a logical/consistent manner (I am hoping that simply "filling out" the 4 things below, it'll be magically correct, but I suspect not).

I was looking at this outstanding post:
Real-life examples of Markov Decision Processes

Copying the comments about the absolute necessary elements:

  1. States: these can refer to for example grid maps in robotics, or for
    example door open and door closed.
  2. Actions: a fixed set of actions, such as for example going north, south, east, etc for a robot, or opening and closing a door.
  3. Transition probabilities: the probability of going from one state to another given an action. For example what is the probability of an
    open door if the action is open. In a perfect world the later could be
    1.0, but if it is a robot, it could have failed in handling the door knob correctly. Another example in the case of a moving robot, would
    be the action north, which in most cases would bring it in the grid
    cell north of it, but in some cases could have moved too much and
    reached the next cell for example.
  4. Rewards: these are used to guide the planning. In the case of the grid example we might want to go to a certain cell, and the reward
    will be higher if we get closer. In the case of the door example an
    open door might give a high reward.

I'm not sure if this would work as an MDP, but I was thinking about using DVR scheduling as a state space. Assuming 2 tuners (so I can schedule and record 2 things at once)

My thought was that I could try to maximize the number of minutes recorded, so if 2 shows are set to record at the same time, but one is longer, then pick the second one, alternately if one is longer and starts just after another, pick the longer one, and skip recording the first. (No partial records), there are a lot of considerations

For States is it only States of the actual Tuners that I need to identify or states of the scheduling algorithm? Should I be drawing a State Machine Diagram and describing what to do if a new request comes in and is processed FOR DVRX?

  1. States {Not Recording, Recording DVR1, Recording DVR2, Recording DVR1 and DVR2} –> How do I account for pre-planning states, so assuming the window from 1-2pm is scheduled by 2 dvr's and a new request comes in, how do I factor that into a state?
  2. Actions {Start Recording DVR1, Start Recording DVR2, Stop Recording DVR1, Stop Recording DVR2} –> I think these make sense, agreed?
  3. Transition Probabilities {1.0} –> I think these are high, I don't expect the DVR to continue recording unless there is a bug, so I suppose I could put something like .99 for each of the transitions? Are there as many probabilities in here as there are actions, or is it the probability for ANY of the actions?
  4. Rewards {length of recording request in minutes} –> reward is the length of the recording, so if two recording requests came in (for DVR1) one is 30 minutes, the other 60 minutes, then the reward if picking "optimal" is 60 (minutes).

I'm not sure how the input is factored into the above, so would we consider that a state change? (Incoming 30 minute show?), Any ideas how to account for time? Lets assume this is a 24 hour window.

Does that even make sense, is this the right approach?

I took a stab at making a state machine, should I have more things called out explicitly or is this sufficient detail for an MDP?
(I assume Blue Bubbles are States, and would the lines between be considered "actions"?)
enter image description here

Best Answer

I like your approach of starting by defining the items of an MDP (States, Actions, Transition probabilities, and Rewards):

$$\text{MDP} = \langle S,A,T,R,\gamma \rangle$$

However I am not sure if your example is that clear:

  • For the transition probabilities $T$ you indicate:

    I think these are high, I don't expect the DVR to continue recording unless there is a bug, so I suppose I could put something like .99 for each of the transitions?

    I do not think there will be a high error rate neither, so the transitions will practically be deterministic (i.e. a state 'probability' of 1.0 for going to the 'next' state).

  • Your reward seems to be correct: the number of minutes, since you want to maximize the number of minutes recorded.

  • The states you suggest do not contain enough information to 'solve' the problem. For example the rewards should be defined based on the states and potentially action: $R(s,a,s')$, but with the states you defined this cannot be calculated.

In order to plan based on the scheduling time of a program on your recorders I think you need the whole schedule in the states.

Scheduling with MDPs

Finite horizon MDPs are used to do scheduling, in most works they have time in the state. M. Dirks wrote an essay (Survey of Dynamic Medical Resource Allocation Methods) about different patient scheduling methods using MDPs. In all the papers they have to overcome the curse of dimensionality, which refers to the exponential growth of search space with the number of states by simplifying the states. Most methods use the count of type of patients: waiting, in treatment 1, in treatment hall 2, emergency case, etc. The actions are defined as the number of patients to choose for the next time slot. See M. Dirks’ presentation for more details. L. Zhu et al. (2013) do patient scheduling using number of reserved slots per day as states, and the action consists of the number of patients (per type) accepted for the next day.

Schneider, J.G. et al. (1998) gives a solution to factory scheduling using MDPs. The states contain the time, product inventory, and factory configuration. The actions are factory configurations.

Recording Scheduler MDP

Assuming there are three time slices in a day (to simplify, later you can extend it to 48) I came up with the following model:

States

  1. $S_\text{time} = \{t_1, t_2, t_3\}$, the time steps in a day
  2. $S_\text{rec}$ = {Rec1, Rec2}, this is the recorder.
  3. $S_\text{record_time_left} = \{0,1,2,3\}$ (time slices).
  4. $S_\text{schema}$, all possible schemas, a subset of: $\{(t_1: 1), (t_1: 2), (t_1: 3), (t_2: 1), (t_2: 2), (t_3,1) \}$, where each tuple $(t_i, n)$ is a requested recording starting at $t_i$ with a duration of $n$.

The final state would be:

$$S = S_\text{time} \times S_\text{rec} \times S_\text{record_time_left} \times S_\text{schema}$$

This gives me, for this problem, $|S|=3 \cdot 2 \cdot 4 \cdot 64 = 1536$ states.

Note that the channel number where to record from is also important, but for simplicity I leave this out.

Actions

Actions are: $a=( (Rec_1,t_{r1},n_1), (Rec_2,t_{r2},n_2) )$ where $Rec_k$ refers to recorder $k$, recording the program that started at $t_k$ with a duration of $n_k$.

Reward

$R(s)$: the sum of time recorded by all recorded in the current state (with timestamp $t_\text{time}$). If Rec1 is recording nothing, and Rec2 has record_time_left=2 then the reward would be $0+1=1$ since in the current state it will only record 1 time step.

Transitions function

The transitions function $P(s’|s,a)$ should now be defined with the following specifications:

  • The next state should have a time step higher (i.e. always from $t_i$ to $t_{i+1}$).
  • The record_time_left should be reduced for both recorders (with 1 time unit).
  • The schema should also be adapted such that it only contains whole programmed recordings, and no past (i.e. before $t_i$).
  • And you probably will have to put some probability to have upcoming schedule changes, since the scheduled programming might be changed in the future.

Policy

The found policy should then give you the best action based on the time, state of the recorders (recording or not and for how long), and the requested recording schedule.

Other examples

There are quite some good lectures available on-line with nice examples, here I name just a few:

MDP Diagram

Finally, there are different ways of drawing an MDP diagram:

But the main thing is that you indicate the states as nodes, and transactions as edges with their probabilities, actions, and rewards.

Related Question