Solved – How does deep learning and reinforcement learning combined to use together

conv-neural-networkdeep learningmachine learningneural networksreinforcement learning

Recent advance in Google's AlphaGo show a really powerful use of deep learning and reinforcement learning in the complicated space (Go). How did we use deep learning and reinforcement learning together, for example, in Atari or Go?

As far as I know, when we say use them together, we are talking about use deep learning (e.g., CNN) to predict the Q in the reinforcement learning, and then use the Q to make the decision, am I right?

Best Answer

Short Answer:

AlphaGo used reinforcement learning to further tune its policy function deep neural network, which it then used to simulate many games for its value function deep neural network. Collectively, these two deep neural networks were then used to dramatically reduce the space of optimal moves to search, horizontally via the policy network and vertically via the value network.

Long Answer:

AlphaGo built two different deep learning neural networks. The first network predicted which move an expert would make. AlphaGo then used reinforcement learning to further tune this neural network by making it play many games against itself. Both the supervised learning approach and reinforcement approach used back-propagation to update the weights of the neural net. The simulated games were then used to build a second deep learning neural network to predict whether AlphaGo would win the game given the state of the board.

When playing a live game, AlphaGo would then use the first neural network to find likely moves. Promising moves were evaluated in two ways. First AlphaGo would use a simple softmax/logistic regression model to quickly simulate moves (after the promising move) until someone won. They used logistic regression for this simulation instead of the deep net because it could run in microseconds instead of milliseconds (they had to run many simulations). Second, they would use the value function deep neural net to predict who would win. They would then average the result of the predicted win/loss with win/loss of the simulated game to arrive with an estimated value for the promising action. These two evaluation techniques were then repeated many, many times (via a processed called Monte Carlo Tree Search, MCTS) before AlphaGo picked its actual move.

Related Question