17

I was going through this implementation of DQN and I see that on line 124 and 125 two different Q networks have been initialized. From my understanding, I think one network predicts the appropriate action and the second network predicts the target Q values for finding the Bellman error.

Why can we not just make one single network that simply predicts the Q value and use it for both the cases? My best guess that it's been done to reduce the computation time, otherwise we would have to find out the q value for each action and then select the best one. Is this the only reason? Am I missing something?

nbro
  • 39,006
  • 12
  • 98
  • 176
amitection
  • 307
  • 2
  • 6

1 Answers1

13

My best guess that it's been done to reduce the computation time, otherwise we would have to find out the q value for each action and then select the best one.

It has no real impact on computation time, other than a slight increase (due to extra memory used by two networks). You could cache results of the target network I suppose, but it probably would not be worth it for most environments, and I have not seen an implementation which does that.

Am I missing something?

It is to do with stability of the Q-learning algorithm when using function approximation (i.e. the neural network). Using a separate target network, updated every so many steps with a copy of the latest learned parameters, helps keep runaway bias from bootstrapping from dominating the system numerically, causing the estimated Q values to diverge.

Imagine one of the data points (at $s, a, r, s'$) causes a currently poor over-estimate for $q(s', a')$ to get worse. Maybe $s', a'$ has not even been visited yet, or the values of $r$ seen so far is higher than average, just by chance. If a sample of $(s, a)$ cropped up multiple times in experience replay, it would get worse again each time, because the update to $q(s,a)$ is based on the TD target $r + \text{max}_{a'} q(s',a')$. Fixing the target network limits the damage that such over-estimates can do, giving the learning network time to converge and lose more of its initial bias.

In this respect, using a separate target network has a very similar purpose to experience replay. It stabilises an algorithm that otherwise has problems converging.

It is also possible to have DQN with "double learning" to address a separate issue: Maximisation bias. In that case you may see DQN implementations with 4 neural networks.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • 2
    For additional reading, one can refer to the paper that introduced the Double DQN: [Deep Reinforcement Learning with Double Q Learning](http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12389/11847) (2016, AAAI-16) by Hado van Hasselt et al. – amitection Aug 06 '18 at 17:00