What is the idea behind double DQN?
The target in double DQN is computed as follows
$$ Y_{t}^{\text {DoubleQ }} \equiv R_{t+1}+\gamma Q\left(S_{t+1}, \underset{a}{\operatorname{argmax}} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}\right) ; \boldsymbol{\theta}_{t}^{\prime}\right), $$ where
- $\boldsymbol{\theta}_{t}^{\prime}$ are the weights of the target network
- $\boldsymbol{\theta}_{t}$ are the weights of the online value network
- $\gamma$ is the discount factor
On the other hand, the target in DQN is computed as
$$Y_{t}^{\mathrm{DQN}} \equiv R_{t+1}+\gamma \max _{a} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}^{-}\right),$$ where $\boldsymbol{\theta}_{t}^{-}$ are the weights of the target network.
The target network for evaluating the action is updated using weights of the online network and the value fed to the target value is basically the old q value of the action.
Any ideas on how or why adding another network based on weights from the first network helps? Any example?