Deep Q Networks and Double DQN



Hello guys,

Today we will be taking about a very famous algorithm, Deep Q Learning. This paper by Deepmind, which can be found - here, really took the scale of problems which RL can be applied to the next level.

The Paper is easy to read , so will not be going into details of the workings of the algorithm.  We will rather be looking at the main ideas of the paper, and how they get it work across a suite of Atari Games.

My Tensorflow implementation of DQN, and even more recent improved 2016 Paper, Double Deep Q Networks can be found - here.


 TrainingSteps vs Summed Minibatch Squared  error of Double DQN: CartPole Environment


This beauty of this paper is that it does not introduced any network or concepts. It has 3 simple tweaks, which makes Neural Networks work as function approximators in the RL Settings.

1) Experience Replay: Arguably most important of the 3. This allows the agent to store past      experiences, and use it in learning. What this does is prevent the agent from going towards local    minima, by not allowing it to forgot past experiences. This is also necessary as this makes sures that the samples in a batch are not highly correlated! A more recent version called Prioritized Experience Replay can be found - here.

2) Target Networks: The primary problem associated with RL with neural networks is the problem of non-stationary targets. As the policy changes over time, so does the target values. This is an important addition which stabilises learning greatly. The weights used to generate the targets of the network are kept constant for a period of time, and then changed.

3) Tackling Non Markovity:  They concatenate the last 4 frames and feed it the network. What this does is give the agent some added information. For example, in the game of pong, it is impossible to determine the velocity of the ball in a single frame! Adding 4 frames makes gives the agent the crucial information it needs to find the optimal policy.

A better way would to incorporate Long Short Term memory networks, as we can incorporate much more past frames into account. This paper by UT Austin deals with this. The problem we get by incorporation LSTM's is that LSTM's need sequential data. This breaks the IID assumption, as consecutive samples will be highly correlated!

This is an exciting time to be involved in RL research! We are finally able to scale our algorithms to tackle large scale problems.

Drop me a mail if you need any help with anything in this post.

Cheers :)