Questions on the identifiability issue and equations 8 and 9 in the D3QN paper

Question

I have difficulty understanding the following paragraph in the below excerpts from page 4 to page 5 from the paper Dueling Network Architectures for Deep Reinforcement Learning.

The author said "we can force the advantage function estimator to have zero advantage at the chosen action."

For the equation $(8)$ below, is it correct that $A - \max A$ is at most zero?

... lack of identifiability is mirrored by poor practical performance when this equation is used directly.

To address this issue of identifiability, we can force the advantage function estimator to have zero advantage at the chosen action. That is, we let the last module of the network implement the forward mapping

$$Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + \left( A(s, a; \theta, \alpha) - \max_{a' \in | \mathcal{A} |} A(s, a'; \theta, \alpha) \right). \tag{8}$$

Now, for $a^∗ = \text{arg max}_{a' \in \mathcal{A}} Q(s, a'; \theta, \alpha, \beta) = \text{arg max}_{a' \in \mathcal{A}} A(s, a'; \theta, \alpha)$, we obtain $Q(s, a^∗; \theta, \alpha, \beta) = V (s; \theta, \beta)$. Hence, the stream $V(s; \theta, \beta)$ provides an estimate of the value function, while the other stream produces an estimate of the advantage function.

I would like to request further explanation on Equation 9, when the author wrote what is bracketed between the red parentheses below.

An alternative module replaces the max operator with an average:

$$Q(s, a; \theta, \alpha, \beta) = V (s; \theta, \beta) + \left( A(s, a; \theta, \alpha) − \frac {1} {|A|} \sum_{a' \in \mathcal{A}} A(s, a'; \theta, \alpha) \right). \tag{9}$$

On the one hand this loses the original semantics of $V$ and $A$ because they are now off-target by a constant, but on the other hand it increases the stability of the optimization: with (9) the advantages only need to change as fast as the mean, instead of having to compensate any change to the optimal action’s advantage in (8).

In the paper, to address the identifiability issue, there are two equations used. My understanding is both equations are trying to fix the advantage part - the last module.

For equation $(8)$, are we trying to make $V(s) = Q^*(s)$, as the last module is zero?

For equation $(9)$, the resulting $V(s)$ = true $V(s)$ + mean$(A)$? As the author said "On the one hand this loses the original semantics of $V$ and $A$ because they are now off-target by a constant". And the constant refers to mean$(A)$? Is my understanding correct?

Please, next time, ask only one question per post. Even if your questions are related, you should ask each in its separate post, so that people can focus on 1 question at time and future users/readers can find the answer to their specific question more quickly. — nbro, Jan 25 '23 at 22:08

Dennis Soemers · Answer 1 · 2018-09-27T08:37:13.810

Yes, you're correct, if Equation 8 is used it will only be possible to get estimates $\leq 0$ out of the term

$$\left( A(s, a; \theta, \alpha) - \max_{a' \in \vert \mathcal{A} \vert} A(s, a'; \theta, \alpha) \right).$$

This matches the meaning that we intuitively assign to the $Q(s, a)$, $V(s)$, and $A(s, a)$ estimators (I'm leaving the parameters $\theta$, $\alpha$, and $\beta$ out of those parentheses for the sake of notational brevity). Intuitively, we want:

$Q(s, a)$ to estimate the value of being in state $s$ and executing action $a$ for the policy that we are learning about.
$V(s)$ to estimate the value of being in state $s$ for the policy that we are learning about.
$A(s, a)$ to estimate the advantage of executing action $a$ in state $s$ for the policy that we are learning about.

In the above three points, "the policy that we are learning about" is the greedy policy, the "optimal" policy given what we have learned so far (ideally this would be truly the optimal policy after a long period of training).

In the last point of the three points above, advantage can intuitively be understood as the gain in estimated value if we choose action $a$ over whatever the expected value would be if we were following the policy that we are learning about.

Since we are trying to learn about the greedy policy, we'll ideally (according to our intuition) want the maximum advantage $A(s, a)$ to be equal to $0$; intuitively, the best action is precisely the one we want to execute in our greedy policy, so that best action should not have any relative "advantage". Similarly, all non-optimal actions should have a negative advantage, because they are estimated to be worse than what we estimate to be the optimal action(s).

This intuition is mathematically enforced by using Equation 8 from the paper for training:

$$Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + \left( A(s, a; \theta, \alpha) - \max_{a' \in \vert \mathcal{A} \vert} A(s, a'; \theta, \alpha) \right).$$

We can consider two cases to explain what this is doing:

Suppose that action $a$ is the best action we could have selected in state $s$ according to our current estimates, i.e. $a = \arg \max_{a' \in \vert \mathcal{A} \vert} A(s, a'; \theta, \alpha)$. Then, the two terms in the large brackets are equal to each other, so the subtraction yields $0$, and the state-action value estimate $Q(s, a)$ equals the state value estimate $V(s)$. This is exactly what we want because we are trying to learn about the greedy policy.
Suppose that action $a$ is worse than the best action we could have selected in state $s$ according to our current estimates, i.e. $A(s, a; \theta, \alpha) < \max_{a' \in \vert \mathcal{A} \vert} A(s, a'; \theta, \alpha)$. Clearly, I've just stated here that the first term in our subtraction is less than the second term in our subtraction... so the subtraction yields a negative number. This means that the state-action value estimate $Q(s, a)$ becomes less than the estimated state value $V(s)$. This is also what we want intuitively, because we started with the assumption that action $a$ was a suboptimal action. Clearly, if we assume that the action $a$ is suboptimal, that should lead to a reduction in the estimated value.

Note that afterwards, when they start explaining Equation 9, they actually intentionally deviate from these standard, intuitive understandings that we have of what the three estimators should represent.

Concerning the additional question about Equation 9:

A major problem in the stability of training processes for Deep Reinforcement Learning algorithms (such as these DQN-based algorithms) is that the update targets contain components that are predictions made by the NN that is being trained. For example, the Dueling DQN architecture in this paper generates $V(s)$ and $A(s, a)$ predictions, which are combined into $Q(s, a)$ predictions, and those $Q(s, a)$ predictions of the network itself are also used (combined with some non-prediction reward observations $r$) in the loss function defined to train the Neural Network.

In other words, the Neural Network's own predictions are a part of its training signal. When these are used to update the Network, this will likely change its future predictions in similar situations, which means that its update target will also actually change when it reaches a similar situation again; this is a moving target problem. We do not have a consistent set of update targets as we would in a traditional supervised learning setting for example (where we have a dataset collected offline with fixed labels as prediction targets). Our targets are moving around during the training process, and this can destabilize learning.

Now, in that explanation following Equation 9, they essentially argue that this "moving target" problem is less bad with Equation 9 than it is with Equation 8, which can result in more stable training. I'm not sure if there is a formal proof of this, but intuitively it does make sense that this would happen in practice.

Suppose that you update your network once based on Equation 8. If your learning step changes the prediction of the advantage $A(s, a)$ of the best action $a$ by a magnitude of $1$ (kind of informal here, hopefully it makes sense what I'm trying to say), this will in turn move future targets for updates also roughly by a magnitude of $1$ (again, quite informal here).

Now, suppose that you update your network once based on Equation 9. It is unlikely that all of the different actions $a$ have their advantage $A(s, a)$ move by the same magnitude and in the same direction as a result of this update. It is more likely that some will move up, some will move down, etc. And even if they all move in the same direction, some will likely move by a smaller magnitude than others. In some sense, Equation 9 "averages out" the movements triggered by the learning update in all of these different advantage estimates, which causes the network's prediction targets overall to simply move more slowly, reducing the moving target problem. At least, that's the intuitive idea. Again, I don't think there is a formal proof that this happens, but it does turn out to often help in practice.

Really appreciate your detailed explanation. I I have updated my question with further comments. Could you help me out further? Thanks. — Cheng, Sep 27 '18 at 01:03
@Cheng See edit in answer for explanation of Equation 9. As for vector or matrix notation of Equation 8, yeah, that's fine. When talking about the Equations, I find it easier to consider them in scalar representation rather than vector representation, but they can be interpreted as vectors. — Dennis Soemers, Sep 27 '18 at 08:38
@Cheng Does my edit help to make things more clear? or is something unclear still? — Dennis Soemers, Sep 30 '18 at 11:40
Thanks @Dennis Soemers again for the detailed explanation. I replied in the updated question. — Cheng, Oct 01 '18 at 01:50
@Cheng Short explanation: Yes, your thoughts in your last edit are correct. Eq. 8 and Eq. 9 are both intended to address the identifiability issue, in different ways. The advantage of Eq. 8 is that it more closely matches our intuition of what the estimates should "mean". The advantage of Eq. 9 over Eq. 8 is that it suffers less from the "moving target" problem explained in my answer, and hence tends to result in a more stable learning process in practice. Does that help? — Dennis Soemers, Oct 01 '18 at 11:07
Thanks @Dennis Soemers, I updated the question again to include another reply from different source, and my further question. How do you think about my last question? — Cheng, Oct 01 '18 at 11:42
@Cheng Yes I think your understanding is correct. Maybe the "+ mean(A)" should be "- mean(A)" instead in your last edit, not 100% sure right now. It's a bit confusing with the informal shorthand notations (without full arguments). Subtracting mean advantages instead of max advantages from all $Q$ estimates makes all $Q$ estimates slightly larger (by the same constant) than they "should" intuitively be... so yes, I think you're right. — Dennis Soemers, Oct 01 '18 at 12:13
Thanks @Dennis Soemers, I have different opinion from yours. The Q estimates should remain the same. My understanding is for Equation 9, estimates for V(s) will becomes slightly larger(by a constant). And the constant is mean(A(s)). — Cheng, Oct 01 '18 at 23:11
@Cheng Yes you're technically right, $Q$ estimates are still the same since they're the final outputs that we train using the standard loss function etc. I was just comparing Eq 8 to Eq 9; in Eq 9, we subtract a smaller $A$ value than in Eq 8 (mean rather than max), so since there's a $Q$ before the $=$ sign, I automatically read that as "well, the quantity on the left-hand side of the equals sign must then also be a bit larger". But technically you're right, the difference would be in the $V$ rather than the $Q$ due to how the learning algorithm works. — Dennis Soemers, Oct 02 '18 at 08:30
@Cheng SE format is not a discussion board. You should not keep updating the question, and thus mutating the topic. If you have another question, just post a follow-up as a separate Q&A. — BartoszKP, Oct 02 '18 at 15:13
@BartoszKP, I disagree with you. I believe SE is a knowledge service in the form of Q&A. It should be a form that is easy accessible for a user and also the future questioner. I don't think the same question of identifiability should be divided. That is my understanding. — Cheng, Oct 02 '18 at 22:59
@Cheng Yes, the purpose of the site is to create a knowledge base for everyone. It should however be a knowledge base consisting of specific questions and specific answers, not chat history between you and the one who answers. Please see: https://meta.stackexchange.com/questions/43478/exit-strategies-for-chameleon-questions and https://ai.stackexchange.com/help/how-to-ask . Your question with all the alterations and responses is really hard to follow. The question post should contain *only* the question. If you want a clarification use comments, if you want to chat - use chat. — BartoszKP, Oct 03 '18 at 08:13

score 0 · Answer 2 · edited Jun 17 '20 at 09:57

I believe that is explained on the prior page:

"Intuitively, the value function $V$ measures the how good it is to be in a particular state $s$. The $Q$ function, however, measures the the value of choosing a particular action when in this state. The advantage function subtracts the value of the state from the $Q$ function to obtain a relative measure of the importance of each action."

Then two paragraphs above were you started your quote:

"However, we need to keep in mind that $Q(s, a; \theta, \alpha, \beta)$ is only a parameterized estimate of the true $Q$-function. Moreover, it would be wrong to conclude that $V (s; \theta, \beta)$ is a good estimator of the state-value function, or likewise that $A(s, a; \theta, \alpha)$ provides a reasonable estimate of the advantage function.

Equation (7) is unidentifiable in the sense that given $Q$ we cannot recover $V$ and $A$ uniquely. To see this, add a constant to $V (s; \theta, \beta)$ and subtract the same constant from $A(s, a; \theta, \alpha)$. This constant cancels out resulting in the same $Q$ value. This lack of identifiability is mirrored by poor practical performance when this equation is used directly."

Another way of looking at it would be:

You receive answers to your question
Answers receive votes
Answerers have reputation

In a perfect world people could vote based on reputation, with a weighing based upon the correctness of the answer.

You could simply look at which answer received the most votes and choose it as correct.

In the real world things don't work that way, things are correct or incorrect whether they are measured or not (think quantum mechanics) and measurement doesn't always reveal the true answer.

See: Parameter Estimation.

The estimate of the advantage is only so good, sometimes it's useful to consider it and in other instances it's useful to reject it - intelligently doing both maximizes it's usefulness.

Thanks @Rob. I commented in the updated question. If you understand the paper, please do help me out. Thanks. — Cheng, Sep 25 '18 at 06:23

Questions on the identifiability issue and equations 8 and 9 in the D3QN paper

2 Answers2