Hi Braydon,
I am not really sure why you are only looking at the first two episodes. RL can take thousands of episodes to converge so the first few really don't give you enough information. As a matter of fact, I ran your models for 20 episodes and the action sequence was different after a few episodes or so. If nothing else, I would check the reward formulation since this would drive how the neural networks weights change and thus how actions are selected (in addition to exploration).
Episode: 17/ 20 | Episode Reward : -5.00 | Episode Steps: 5 | Avg Reward : -5.00 | Step Count : 85 | Episode Q0 : -120.83
1.0000e-04
prev_state = 11.90 11.90 12.00 11.20
action = 0.00 0.00 0.00 0.00
new_state = 11.90 11.90 12.00 11.20
prev_state = 11.90 11.90 12.00 11.20
action = 0.10 0.10 -0.10 0.00
new_state = 12.00 12.00 11.90 11.20
prev_state = 12.00 12.00 11.90 11.20
action = -0.10 0.00 -0.10 0.10
new_state = 11.90 12.00 11.80 11.30
prev_state = 11.90 12.00 11.80 11.30
action = -0.10 0.10 0.00 -0.10
new_state = 11.80 12.00 11.80 11.20
prev_state = 11.80 12.00 11.80 11.20
action = 0.10 0.00 -0.10 0.00
new_state = 11.90 12.00 11.70 11.20
Episode: 18/ 20 | Episode Reward : -5.00 | Episode Steps: 5 | Avg Reward : -5.00 | Step Count : 90 | Episode Q0 : -83.15
1.0000e-04
prev_state = 11.70 11.90 11.50 11.60
action = 0.00 0.00 0.00 0.00
new_state = 11.70 11.90 11.50 11.60
prev_state = 11.70 11.90 11.50 11.60
action = 0.10 0.10 -0.10 0.00
new_state = 11.80 12.00 11.40 11.60
prev_state = 11.80 12.00 11.40 11.60
action = -0.10 0.00 -0.10 0.10
new_state = 11.70 12.00 11.30 11.70
prev_state = 11.70 12.00 11.30 11.70
action = -0.10 0.10 0.00 -0.10
new_state = 11.60 12.00 11.30 11.60
prev_state = 11.60 12.00 11.30 11.60
action = 0.10 0.00 -0.10 0.00
new_state = 11.70 12.00 11.20 11.60
Best Answer