1

What is the idea behind double DQN?

The target in double DQN is computed as follows

$$ Y_{t}^{\text {DoubleQ }} \equiv R_{t+1}+\gamma Q\left(S_{t+1}, \underset{a}{\operatorname{argmax}} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}\right) ; \boldsymbol{\theta}_{t}^{\prime}\right), $$ where

  • $\boldsymbol{\theta}_{t}^{\prime}$ are the weights of the target network
  • $\boldsymbol{\theta}_{t}$ are the weights of the online value network
  • $\gamma$ is the discount factor

On the other hand, the target in DQN is computed as

$$Y_{t}^{\mathrm{DQN}} \equiv R_{t+1}+\gamma \max _{a} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}^{-}\right),$$ where $\boldsymbol{\theta}_{t}^{-}$ are the weights of the target network.

The target network for evaluating the action is updated using weights of the online network and the value fed to the target value is basically the old q value of the action.

Any ideas on how or why adding another network based on weights from the first network helps? Any example?

nbro
  • 39,006
  • 12
  • 98
  • 176
joseph
  • 11
  • 1

1 Answers1

1

As the authors of this paper state it:

In $Q$-learning, the agent updates the value of executing an action in the current state, using the values of executing actions in a successive state. This procedure often results in an instability because the values change simultaneously on both sides of the update equation. A target network is a copy of the estimated value function that is held fixed to serve as a stable target for some number of steps.

If I remember it correctly, the main concern is that the network could end up in a positive feedback loop, making sufficient exploration of various action and state combinations less likely to occur, which could be detrimental to the learning task.

Daniel B.
  • 805
  • 1
  • 4
  • 13