I'd recommend getting an overview of the math that's currently used in deep learning architectures that are used for supervised settings (this does mean looking into approaches that involve "training sets"), before you dive deeper into other math.
http://www.deeplearningbook.org/ has a very good overview of the math you'd need to understand what's going on in neural nets/deep nets. Once you're comfortable with the current approaches, you'd be able to understand the research in the field, and the directions it's heading in. (from ICML, NIPS papers, for instance)
At that point, you will likely find open problems that seem to interest you, and you can begin to actively work on them. It's often useful to have a problem you want to solve in mind, and then explore all the work that's been done on that problem (prior approaches, the math involved, etc) - Sometimes, you'll find that there are some problems that interest you deeply, but the current approaches to solve them are unsatisfying - this is really the point when you might have to invent(discover?) the math needed to solve it, or "borrow" the math from a different field. The main benefit you'll have if you work on problems that are similar to what other researchers are interested in is that there's a community that's publishing work at a breakneck pace, and you'll be able to quickly get feedback on approaches that have been tried and haven't quite worked just yet.
I'm not quite discounting the value of learning math by itself, but just saying that if you learn the math in the light of a problem (or ten), you'll learn how to apply the existing math well (+ how to do a good literature search), and you'll also learn to figure out when new math is required.
There is a good survey paper here.
As a quick summary, in additional to Q-learning methods, there are also a class of policy-based methods, where instead of learning the Q function, you directly learn the best policy $\pi$ to use.
These methods include the popular REINFORCE algorithm, which is a policy gradients algorithm. TRPO and GAE are similar policy gradients algorithms.
There are a lot of other variants on policy gradients and it can be combined with Q-learning in the actor-critic framework. The A3C algorithm -- asynchronous advantage actor-critic -- is one such actor-critic algorithm, and a very strong baseline in reinforcement learning.
You can also search for the best policy $\pi$ by mimicking the outputs from an optimal control algorithm, and this is called guided policy search.
In addition to Q-learning and policy gradients, which are both applied in model free settings (neither algorithm maintains a model of the world), there are also model based methods which do estimate the state of the world. These models are valuable because they can be vastly more sample efficient.
Model based algorithms aren't exclusive with policy gradients or Q-learning. A common approach is to perform state estimation / learn a dynamics model, and then train a policy on top of the estimated state.
So as for a classification, one breakdown would be
- Q or V function learning
- Policy based methods
- Model based
Policy based methods can further be subdivided into
- Policy gradients
- Actor Critic
- Policy search
Best Answer
Short Answer:
AlphaGo used reinforcement learning to further tune its policy function deep neural network, which it then used to simulate many games for its value function deep neural network. Collectively, these two deep neural networks were then used to dramatically reduce the space of optimal moves to search, horizontally via the policy network and vertically via the value network.
Long Answer:
AlphaGo built two different deep learning neural networks. The first network predicted which move an expert would make. AlphaGo then used reinforcement learning to further tune this neural network by making it play many games against itself. Both the supervised learning approach and reinforcement approach used back-propagation to update the weights of the neural net. The simulated games were then used to build a second deep learning neural network to predict whether AlphaGo would win the game given the state of the board.
When playing a live game, AlphaGo would then use the first neural network to find likely moves. Promising moves were evaluated in two ways. First AlphaGo would use a simple softmax/logistic regression model to quickly simulate moves (after the promising move) until someone won. They used logistic regression for this simulation instead of the deep net because it could run in microseconds instead of milliseconds (they had to run many simulations). Second, they would use the value function deep neural net to predict who would win. They would then average the result of the predicted win/loss with win/loss of the simulated game to arrive with an estimated value for the promising action. These two evaluation techniques were then repeated many, many times (via a processed called Monte Carlo Tree Search, MCTS) before AlphaGo picked its actual move.