Diving into the atarigame playing algorithm  Deep QNetworks
December 01, 2019
This was a small collaborative research project about the convergence properties of the Deep Q Network for the MSc Reinforcement Learning course at the University of Amsterdam. Written by Leon Lang, Igor Pejic, Simon Passenheim, and Yoni Schirris.
Introduction
Reinforcement Learning (RL), the idea of having an agent learn by
directly perceiving its environment and acting to manipulate it, is one
of the great branches of Artificial Intelligence. In this blog post you
will read about a specific breakthrough by DeepMind: its success in
creating a single deep RL architecture that was able to achieve gameplay
in Atari games comparable to that of humans across almost all the $49$
games [1]. They called it DQN, which stands for “Deep
QNetwork”. When its predecessor was first introduced in 2013 [2],
it was, according to the authors, the first system that was able to
learn control policies “directly from highdimensional sensory input
using reinforcement learning”.
In the coming sections, we will first study Qlearning, a specific RL
algorithm, in its tabular form. Then we will learn how Deep Qlearning
makes it in principle possible to apply RL to more complex problems and
see the socalled deadly triad, which are three properties that are
often present in these problems and lead to divergence. In the section
that follows, we will shortly introduce the two main two tricks that DQN
utilizes for stability, namely experience replay and a fixed target
network. Finally, we answer, through experimentation, how these tricks
can overcome the divergence problems that partly come from the deadly
triad when evaluating the performance in the OpenAI CartPole gym
environment. Concretely, our research questions are:
 Can we find an environment where DQN diverges without the main two tricks employed in that paper, but converges when including the tricks?
 What are the individual contributions that come from using experience replay and a fixed target network?
We will answer these questions by investigating the return, i.e.
cumulative reward, over time achieved by the agent. We expect that
adding both tricks allows the method to converge on the chosen
environment, while only using either of the tricks won’t. Additionally,
we will look at the phenomenon of “soft divergence”, i.e.
unrealistically large Qvalues. We hope to shed some light on the
effectiveness of these tricks in less complex environments than Atari
games, allowing beginners in the field of RL to get an intuition of why
DQN works so well, and to think of ways how to improve them even
further.
This post assumes some familiarity with RL. If you feel a lack of prior
knowledge while reading the post, we can recommend reading parts of the
classical introduction into Reinforcement Learning
3 or watching the great (available online)
lecture by David
Silver,
one of the creators of the famous
AlphaZero.
QFunctions and QLearning
In the explanations that follow, there is always the implicit notion of
a MarkovDecision process (MDP): it consists of a finite state space
$\mathcal{S}$, a finite action space $\mathcal{A}$, a reward space
$R \subseteq \mathbb{R}$, a discount factor $\gamma$ and transition
dynamics $p(s', r \mid s, a)$ that defines the probabilities to arrive
in the next states with a certain reward, given the current state and
action.
The agent’s goal is to maximize the expected return $G$, which is the
cumulative discounted reward. This can be modeled by $Q$functions,
which represents the expected return for each state action pair
$Q(s,a) = E\left[G \mid S = s, A = a \right] = E[R_1 + \gamma R_2 + \gamma^2R_3 + \dots \mid S = s, A = a],$
where the rewards are stochastically determined by the actions according
to the agent’s policy (i.e. a distribution over actions
$\pi(a \mid s)$ for each state $s \in \mathcal{S}$) and the environment
dynamics $p(s', r \mid s, a)$. In particular, the $Q$function provides
the information on which action leads to the highest value, given a
state, when following the given behaviour afterwards.
Qlearning is the process of finding the $Q$function of the optimal
policy while following a behaviour policy, e.g. $\epsilon$greedy. This
can be done either in a tabular fashion or using function approximation:
Tabular QLearning
In tabular Qlearning, we use the following update rule:
$Q(s, a) \leftarrow Q(s, a) + \alpha \cdot \left(r + \gamma \max_{a'} Q(s', a')  Q(s,a) \right).$
Here, $s'$ is the next state and $\alpha$ is a hyperparameter, namely the learning rate. Under the assumption that each stateaction pair $(s,a)$ is visited infinitely often, this algorithm always converges to the $Q$function of the optimal policy [2].
Note that with growing dimension of the state space the number of possible states grows exponentially. This is also called the curse of dimensionality since it makes tabular $Q$learning and similar tabular methods intractable in highdimensional spaces. Therefore, we need ways to approximate this process.
QLearning with Function Approximation
In order to model these more complex scenarios, we model the Qfunction as a parametrised differentiable function $Q(s, a; \theta)$ that takes in the state and action and, depending on the parameter values, outputs a specific value. In practice, this is often modelled as a function $Q(s; \theta)$ that only takes in the state and outputs a whole array of $Q$values, one for each action. For this blog post, you can assume that $Q(s; \theta)$ is a neural network with $\theta$ as its weights. Then, what is learned are the parameters $\theta$, of which there are fewer than there are states in the state space. The update rule of semigradient Qlearning is then the following:
$\Delta\theta \propto \left(r + \gamma \max_{a'} Q(s', a'; \theta)  Q(s, a; \theta)\right) \nabla_{\theta} Q(s, a; \theta) \tag{1}$where $s$ is your last state, $a$ is the action you chose in that state, $r$ is the reward that followed it and $s'$ is the new state you found yourself in. It is a “semigradient method” because Equation (1) is not the full gradient of the loss $L = \left(r + \gamma \max_{a'} Q(s', a'; \theta)  Q(s, a; \theta)\right)^2.$ Instead, the target $r + \gamma \max_{a'} Q(s', a'; \theta)$ is treated as a constant.
The Deadly Triad
The resulting method now has the following properties:
 The method is offpolicy. That is, no matter what the behaviour policy is, due to the maximization process in the updaterule, we actually directly learn the stateaction values of the optimal policy. This allows the agent to explore all possible actions, while at the same time learning what the optimal actions are.
 It is a bootstrapping method. That means, instead of updating its Qvalue with an actual outcome of the behaviour, the agent uses an estimate $Q(s', a'; \theta)$ for the update which is itself not perfect and might be biased.
 The method uses function approximation as explained before.
The first two properties were already present in tabular $Q$learning,
but together with the added function approximation it is called the
“deadly triad” since it can lead to divergence of the algorithm^{1}.
In the following section, we explain the tricks that [1]
introduced to tackle the deadly triad and other convergence problems
that may arise in Deep RL. It is largely an open research question how
large the influence of the deadly triad is on divergence problems in
Deep RL, but specifically the fixed target network seems targeted for
precisely that, as described in [4]. Other problems come from the
nature of RL itself, with its highly noni.i.d. data.
DQN and its Tricks
In principle, the method is just usual semigradient $Q$learning with deep neural networks as function approximators. But the authors employ two important tricks:
1. Experience Replay
This describes the process of storing transitions $(s_t, a_t, r_t, s_{t+1})$ in a memory $D$ from which minibatches are later sampled during training.
That process has several advantages:
 It breaks correlations, so that the data is more i.i.d. This reduces variance of the updates and increases stability, since correlated data might lead to too large steps in one direction.
 One has gains in efficiency due to reused data. Note that $Q$learning is an offpolicy method, and so using data from the past, when the behaviour policy was still quite different, does not lead to problems.
2. Fixed Target Network
Recall that the update in semigradient Qlearning is given by Equation (1), where $(s, a, r, s')$ is a saved transition and $\theta$ is the current parameter vector.
There can be the following problem with this framework. First, some notation: Denote the target by $T(\theta) = r + \gamma \max_{a'} Q(s', a'; \theta)$ and imagine it is larger by $1$ compared to $Q(s, a; \theta)$. Then we obtain $\Delta \theta \propto \nabla_{\theta} Q(s, a; \theta)$ and move in the direction of increasing $Q(s, a; \theta)$. But $Q$ is a function approximator, so this is not the only value that changes and similar stateaction pairs will change their value as well. Now, $s'$ and $a'$ occur only one timestep after $s$ and $a$, so they may be perceived as very similar when processed by the $Q$network. Consequently, $Q(s', a'; \theta)$ may increase as much as $Q(s, a; \theta)$. The result: $T(\theta^{\text{new}})$ may be one larger than $Q(s, a; \theta^{\text{new}})$ again!
I guess you already see the problem: assuming we would run this indefinitely, the Qvalue could explode. This happens due to the offpolicy nature of the algorithm which does not ensure that $Q(s', a'; \theta)$ gets the relevant feedback later on which would make it smaller again.
For this problem, it may help to fix the target network, which was proposed in [1] for the first time. In that framework, you have parameters $\theta$ which you update constantly, and parameters $\theta^{}$ which you hold constant for $C$ timesteps, until you update them to $\theta$ and hold them constant again. This is claimed to prevent the spiraling that we describe here.
The Experiment
Aim of the Experiments
The experiments we run aim to visualize the impact that the tricks of DQN have. Note that in [1], the authors find that each of the two tricks, experience replay and a fixed target network, have their own contribution to the success of the agent and that their contribution amplifies when they are combined. We are interested in whether we can confirm these results in a simpler environment and highlighting the individual contributions of the effects.
The selfwritten source code of the experiments can be found on GitHub^{2}, where we reference to others if we were inspired by their work.
The Environment
The desiderata for our environment were the following:
 It is simple enough to allow easy experimentation.
 It is complex and difficult enough so that DQN without experience replay and a fixed target network is not able to solve it.
With these criteria in mind, we will see if the tricks proposed in DQN make it solvable.
We qualitatively reviewed the performance of our model trained on all classic control environments of OpenAI Gym^{3}. Performance is measured as the (negative) episode duration (sign dependant on the whether the agent is tasked to have long or short episodes), and plotted over the training period. We found that the Acrobotv1 and MountainCarv0 environments would require further modifications to DQN to be solvable, i.e. even the tricks were not able to lead to convergence. However, the Cartpolev1 environment displayed exactly the required properties: it diverged when using standard semigradient Qlearning with function approximation, but converged when adding experience replay and a target network. In this environment the return is the total number of steps that the pole stays upright, with the upper limit of the number of episodes being 500.
Methods
We have trained the network in the following four versions:
 vanilla semigradient Qlearning with function approximation.
 the same as (1) and including experience replay.
 the same as (1) and including a fixed target network which is updated every $c \in C$ steps. We experiment with different values of $C$ in the set $\{1, 50, 100, 200, 500\}$.
 a combination of (2) and (3). This is the full training process proposed in [1].
Each variant uses the same simple neural network implemented in PyTorch:
class QNetwork(nn.Module):
def __init__(self, num_s, num_a, num_hidden=128):
nn.Module.__init__(self)
self.l1 = nn.Linear(num_s, num_hidden)
self.l2 = nn.Linear(num_hidden, num_a)
def forward(self, x):
x = F.relu(self.l1(x))
x = self.l2(x)
return x
All variants use the same static set of hyperparameters listed in the table below, where we also explain the reasoning behind their choice.
Hyperparameter  Value  Rationale 

Number of episodes  $100$  Methods that proved convergence properties converge after around 50 episodes. 
Batch size  $64$  Standard ”exponent of 2” value with no particular meaning. 
Discount Factor  $0.8$  Handpicked during experimentation and implementation process. 
Learning Rate  $10^{3}$  Handpicked during experimentation and implementation process. 
Memory Capacity  $10.000$  Handpicked during experimentation and implementation process 
Replay Start Size  Batch Size  After $64$ state transitions, we begin training, since then the memory contains enough transitions to fill a batch. 
Seed  $010$  Set of different seeds used to ensure reproducibility and to get uncorrelated results. 
$\epsilon_{\text{start}}$  1  At the beginning we play a maximal exploration strategy since the policy did not find good behaviour yet. 
$\epsilon_{\text{end}}$  $0.05$  We don’t act greedily even during validation, so that the agent cannot overfit on the training experience 
$\epsilon_{\text{steps}}$  $1000$  Within $1000$ updates, $\epsilon$ is linearly annealed to $0.05$. Hand picked during implementation as it gave fast convergence. 
Optimizer  ADAM (default parameters)  State of the art optimizer suitable for most applications. 
NN architecture  Input dim: number of states, i.e. 4; Hidden dim: 128 followed by ReLU; Output dim: number of actions, i.e. 2  Handpicked based on problem and input complexity. 
The Implementation
We have utilized the code obtained during Lab sessions of the course Reinforcement Learning at University of Amsterdam as the skeleton of our code, and added a flag to enable/disable the experience replay trick, and added an integer input to define for how many steps the target network would be fixed (1 meaning it is updated each time i.e. it is not fixed). All the environments used are from the OpenAI Gym library. For the implementation we use several opensource standard machine learning python packages (for details see the source code). The experiments are run on a regular laptop with Intel i57200U CPU and 8GB of RAM on which a single run of 100 episodes takes about 30 seconds. This means that the results we obtained here can easily be checked on your computer in a few minutes.
Results
For assessing the performance of our agent, we look both at the return and at socalled soft divergence. Note that with return, we always mean the undiscounted return, even if the agent discounts, as in our case, with $\gamma = 0.8$. That is, we treat $\gamma$ not as part of the objective, but as a hyperparameter. We will explain soft divergence in the corresponding subsection.
Return
We will present the results of the experiments with plots in which the xaxis represents the number of the episodes and the yaxis represents the return obtained during validation. This means that every 5 episodes we take the trained model, fix $\epsilon$ to 0.05 and evaluate the performance of this model, without training it during this process.
Each curve belongs to one variant of the algorithm. For each variant, we train the agent from scratch 10 times with different seeds and create the line by averaging the returns. The shaded area around each curve is the $95\%$ confidence interval (i.e. it represents $\mu \pm 2\sigma$).
Figure 1 presents the comparison between the models with and without experience replay. In both cases, we did not use the fixed target network. The picture clearly displays the difference in quality of the two different methods showing that experience replay is a very important factor of convergence in this environment. We also ran the method without experience replay for 500 episodes, but still did not see any clearly improving values as seen in Figure 6 in the appendix.
In Figure 2 we display the effect of the target network with different update rates on the obtained returns while turning experience replay off. We observe that the version with an update rate of $50$ is often in the lead, however, we do not see an obvious statistical effect of the target network alone. To keep the graphs uncluttered, we do not show the values of 100 and 500 which perform roughly equal to the 50 value.
In Figure 3 we illustrate the effect of the combination of both tricks. In the case with experience replay, over large parts of the training process the version with an update rate of $200$ steps is leading, although there can not be any statistical significance inferred in comparison to the update rate of $50$ and $1$ (no target network).
Finally, in Figures 1 and [3], we observe that the average episode duration decreases after 75 episodes. To investigate if this is a systematic or purely random effect, we trained the model for 300 episodes which can be seen in Figure 7 in the appendix. In both settings the performance seems to decrease slightly after 100 episodes. To be on the safe side, we therefore suggest to periodically take snapshots of the weights and in the end use the model with the highest average episode duration.
Soft Divergence
Now we look at assessing the socalled soft divergence. Soft divergence was for the first time studied in 4 as follows:
Let $\gamma$ be our discount rate, in our case $0.8$. Since our rewards are all in the interval $[1, 1]$ (actually, it is always $1$ until the agent loses or the episode is over), there is a bound on the maximally achievable true Qvalues. We can see this by bounding the discounted return:
$G = \sum_{n = 0}^{\infty} \gamma^n R_n \leq \sum_{n = 0}^{\infty} \gamma^n = \frac{1}{1  \gamma}.$
In our case, this bound is $\frac{1}{1  0.8} = 5$. Now, soft divergence is the phenomenon of learned Qvalues that exceed this bound. This is connected to the deadly triad: offpolicy learning with bootstrapping and function approximation can lead to a situation in which the Qvalues chase the targets, which themselves can get bigger and bigger until they exceed realistic bounds.
In order to assess the presence of soft divergence, we measure the maximal Qvalues that appear within evaluation episodes. In Figure 4, you can find this for the case without experience replay, and in Figure 5 for the case with experience replay.
First of all, we can see that if we use experience replay, the maximal Qvalue converges precisely to $5$. This is expected, since we already know that an agent using experience replay achieves high return, and so if the Qvalue measures the return correctly, it should output something close to the optimal value, which is $5$.
However, the version without experience replay actually shows soft divergence, with Qvalues being consistently too large in the end of training with values ranging between $5$ and $8$. Additionally, since we already know that the agent performs very bad, we would actually expect Qvalues lower than $5$. Now we get an explanation for why the agent is not able to perform properly: the measurements of the Qvalues are partly too high during the entire training process, and therefore not accurate, and cannot guide the agent to the actions with the best behaviour.
Another interesting phenomenon can be observed by looking at the differences between the different target network update rates. In both plots we see that, in the beginning of the training process, the Qvalues get considerably larger than $5$, ranging up to $25$. That is, all versions of the algorithm show soft divergence in the first 50 episodes, with a peak after $20$ episodes. One explanation might be the maximization bias which is present in Qlearning. Maximization bias is a phenomenon which happens when the algorithm overestimates the Qvalues as a consequence of the $\max$ arguments in the update rule. In practice, this bias has been studied and can be solved for example with double Qlearning explained in [5]. However, since we did not include experiments using double Qlearning, we cannot assess the validity of this explanation.
A second and complementary explanation for this might be the following: in the beginning of training, the learned representations of the Qnetwork are very bad and might not be successful in distinguishing between different stateaction pairs. If that happens, then the increase of one Qvalue might just increase the Qvalues of all other states in the same way, which creates the already explained phenomenon that Qvalues can become unbounded. This would also explain why the problem is less severe when using the fixed target network, and even less severe if the updates of the parameters happen less often.
Conclusion
In this blog post we have analyzed what effect the tricks experience replay and a fixed target network have on the convergence of the DQN algorithm. Our experiments have shown that, by itself, the DQN algorithm without experience replay and fixed target network fails to converge.
Furthermore, we observe that experience replay has a significant impact on the objective, i.e. return, whereas the fixed target network showed almost no effect on that. This is different from what we expected, as we initially expected that only the addition of both tricks would allow the agent to perform optimally.
For the large effect of the experience replay on the return, we believe this is because in a physical system like the cart pole environment subsequent state observations are deterministically dependent. This is the strongest form of correlation and might lead to too large steps in one direction. This can be specifically crucial, because a fast move in one direction of the cart can lead to a change of orientation of the pole in the other direction and to the end of the episode.
The small effect of the fixed target network on the return is contradicted, however, by analyzing its effect on soft divergence. We saw that the fixed target network actually plays an essential role in mitigating the initially very high Qvalues seen in all variants of the algorithm. This does not itself change the return of the agent after convergence, since in all cases, the algorithm is able to correct for the mistake of vastly too large initial Qvalues. Still, these results hint that in more complex environments, we would also see an influence of the fixed target network on the return achieved by agents as updating the target network more rarely can help fight the soft divergence.\
We conclude that generally DQN contains strong and theoretically sound ideas, and we saw that both tricks can, in different ways, help avoiding divergence. However, to train it successfully and to use the utility of the tricks fruitfully, still requires practice and effortful tuning.
Appendix
DQN without Experience Replay after 500 Episodes
DQN with Experience Replay after 300 Episodes
Referenes
[1] Humanlevel control through deep reinforcement learning
[2] Playing Atari with Deep Reinforcement Learning]
[3] Sutton & Barto: Reinforcement Learning Book
[4] Deep Reinforcement Learning and the Deadly Triad
[5] Double Qlearning

It is out of the scope of this blog post to go into the details as to why the deadly triad can lead to divergence. We recommend the interested reader to read chapter $11$ of [@sutton2018reinforcement].
↩ 
https://github.com/YoniSchirris/rllab3
↩  ↩