In the DQN paper, why do we have both $\max_{a'}$ and $\max_{a}$ in the pseudocode?

Question

I was reading this article https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf and in it there is an algorithm of deep q learning with experience replay as follows:

On line 12, when the algorithm is setting the values for y_j, the second line says:

I'm confused as to what a' refers to and where it comes from.

(Edit) Why on this line (line 7) it's a:

But on line 12 it's a' ?

Can someone please explain it to me?

Kostya · Answer 1 · 2023-01-09T18:12:03.153

2

$r_j + \gamma \max_{a'}Q(\phi_{j+1},a';\theta)$
I'm confused as to what $a'$ refers to and where it comes from.

Here $a'$ is a "dummy" argument over which you perform the maximization operation $\max_{a'}$.

In practice, that would correspond to axis (or dim) argument in numpy/pytorch/tensorflow

$a_t = \max_a Q^*(\phi(s_t),a;\theta)$
Why on the line 7 it's $a$

I'd say that in this case it is a sloppy math notation (or just typo) on the authors' part.
It should be argmax, not max. $$a_t = \arg \max_a Q^*(\phi(s_t),a;\theta)$$

edited Jan 09 '23 at 18:12

answered Jan 09 '23 at 16:36

Kostya

2,416
7
23

So, it doesn't refer to an action a'? – Ness Jan 09 '23 at 16:48
Could you check my edit please? – Ness Jan 09 '23 at 17:07
2

@Ness: It is both a dummy argument, and refers to a *potential* action taken as $a_{t+1}$, but not necessarily the *actual* observed action taken during training. For Q-learning, you don't look at the next real action choice, but optimise based on the current maximising action choice, whilst other methods may want to use expected action on the behaviour policy or real observed action that the agent takes next – Neil Slater Jan 09 '23 at 17:13

In the DQN paper, why do we have both $\max_{a'}$ and $\max_{a}$ in the pseudocode?

1 Answers1