Solved – LSTM network in the Asynchronous Advantage Actor-Critic (A3C) algorithm

lstmreinforcement learning

I'm a little confused about the usage of LSTM network in the Asynchronous Advantage Actor-Critic (A3C) algorithm. The input for LSTM network is a sequence and network state, so my question is that when we start learning while the game hasn't been completed, whether I should use the zero state of the network again, or reuse the last stage before learning commence

Best Answer

With an LSTM for each data point you input the observation and LSTM state (zero state for the first step). As an output from LSTM you get the action and the modified LSTM state which you need to feed at the next step.

The easiest option is to do training after you episode ends. In this case you just reset the LSTM state and do the update. Modern libraries handle the state by themselves, so, you just need to input the sequence of input points and targets.

Related Question