I am reading the D3QN paper and they have the following paragraph -
Equation (7) is unidentifiable in the sense that given $Q$ we cannot recover $V$ and $A$ uniquely. To see this, add a constant to $V(s; \theta, \beta)$ and subtract the same constant from $A(s, a; \theta, \alpha)$. This constant cancels out resulting in the same $Q$ value. This lack of identifiability is mirrored by poor practical performance when this equation is used directly.
To address this issue of identifiability, we can force the advantage function estimator to have zero advantage at the chosen action.
How does making the advantage for the chosen action 0 (actually, it's a little larger than 0, cause we'll be using the mean instead of max operator), help the neural network learn accurate estimates of $V$ and $A$?