Solved – Combining reinforcement learning with labelled data

neural networksreinforcement learningsupervised learningtime series

I'm attempting to train a neural network to maximise a function, where the network takes a time series as input. At each time step it will make a decision based on what it thinks will maximise the reward function, however I have a small portion of the data which is labelled with a good action which would be valuable to incorporate.

Does anyone have any suggestions as to how to blend the two approaches?

I was thinking of either pretraining using the labelled data, or at each timestep calculating two weight updates – one which encourages matching the labels and another from gradient ascent on the reward function. I would weight these dependent on time, so that the update based on labelled data has more influence at the start of training and less towards the end. Any pointers/papers would be appreciated!

(combining supervised and unsupervised learning is the gist of what I'm looking for)

Best Answer

Combination of RL and supervised learning is well described in this paper. There is a lot of pointers in Related Work section as well.

In general, they combine the Double DQN approach with a supervised loss (they use a margin loss for that, but you can try to use cross-entropy, for instance). This allows them not only to learn the action values during training, but also to avoid behaving in an 'unusual' manner.