Why do Q-values diverge without a target network?

Question

After reviewing similar posts on this topic, I understand that a target network is used to prevent "divergence", but am not sure what it actually means. Q-values are predicted using a function approximator. The weights of the function approximator are then updated using the difference between the Q-value and TD target ($r + \gamma\max_aQ(s',a)$). Now assuming, that the estimate of the Q-value was wrong, the weights could easily be updated so that they are correct the next time the state is encountered. My confusion arises when online blogposts say that this change in weights modifies the target Q-value($Q(s', a)$) too. I don't see how the target Q-value gets updated when the current Q-value is changed.

Neil Slater · Accepted Answer · 2022-07-01T18:37:23.057

I don't see how the target Q-value gets updated when the current Q-value is changed.

Without a separate target network, this happens because the approximator will generalise, and the generalisation will include the successor/target states. This can be very likely in many environments, since successor states often have many similar features to the previous states (think how similar any video game might look a few video frames on from any position).

In tabular variants of Q learning this does not happen, a change to any single Q value give state, action is always made to a single estimate that is isolated from all other estimates. Adding approximation changes things, and is not avoidable - in fact it is usually desirable to have strong generalisation in order to obtain good estimates for never-seen-before states. However, the flip side of this strong generalisation is that updating estimates for $Q(s,a)$ will impact many other $Q(s_i,a_j)$

One related thing worth bearing in mind is that the Q-function update in DQN is a semi-gradient update. If you do not use a target network, then technically the full gradient needs to take account of the changes to the TD target when the weights change (because both $Q(s,a)$ and $Q(s',a')$ are calculated use the same approximator). So one way to try and solve the same issue is to alter the update to use the full gradient. The maths for this is more complex than normal DQN, but is addressed in the Sutton & Barto book in chapter 11 section 11.7 "Gradient-TD Methods".

In practice for DQN, most experiments seem to prefer some variant of target networks. This also effectively changes the learning task to a full gradient one, with different progression of the TD targets. Although it is not as theoretically nice as a full gradient method, it seems to be a pragmatic choice by many researchers and developers.

The paper Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme attempts to make a fair comparison of DQN with a TD-gradient DQN, and comes to the conclusion that the full gradient approach is better - however, there may be other reasons to prefer the slightly clunkier semi-gradient plus target network approach.

Why do Q-values diverge without a target network?

1 Answers1