-2

I am currently studying reinforcement learning, especially DQN. In DQN, learning proceeds in such a way as to minimize the norm (least-squares, Huber, etc.) of the optimal Bellman equation and the approximate Q-function as follows (roughly): $$ \min\|B^*Q^*-\hat{Q}\|. $$ Here $\hat{Q}$ is an estimator of Q function, $Q^*$ is the optimal Q function, and $B^*$ is the optimal Bellman operator. $$ B^*Q^*(s,a)=\sum_{s'}p_T(s'|s,a)[r(s,a,s')+\gamma \max_{a'}Q^*(s',a')], $$ where $p_T$ is a transition probability, $r$ is an immediate reward, and $\gamma$ is a discount factor. As I understand it, in the DQN algorithm, the optimal Bellman equation is approximated by a single point, and the optimal Q function $Q^*$ is further approximated by an estimator different from $\hat{Q}$, say $\tilde{Q}$. \begin{equation}\label{question} B^*Q^*(s,a)\approx r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\approx r(s,a,s')+\gamma\max_{a'}\tilde{Q}(s',a'),\tag{*} \end{equation} therefore the problem becomes as follows: $$ \min\|r(s,a,s')+\gamma\max_{a'}\tilde{Q}(s',a')-\hat{Q}(s,a)\|. $$

What I want to ask: I would like to know the mathematical or theoretical background of the approximation of \eqref{question}, especially why the first approximation is possible. It looks like a very rough approximation. Can the right-hand side be defined as an "approximate Bellman equation"? I have looked at various literature and online resources, but none of them mention exact derivation, so I would be very grateful if you could tell me about reference as well.

user
  • 97
  • Hi. Can you please edit your post to include a link to the research paper or book where you took these equations from, in order to provide some context? – nbro Apr 09 '21 at 13:12

1 Answers1

0

Both your notation and terminology are quite confusing. For example, I'm not sure what is an "optimal" Bellman operator is. Here's a good clarification on definition of a Bellman operator. Likewise, your description of the DQN algorithm completely ignores the averaging over states/actions/rewards sampled from the replay memory.

Trying to savage your notation, I'll introduce a Bellman operator $B$ that acts on any state-value function $Q$ as:

$$B[Q(s,a)] = \mathbb{E}_{s'}\left[r(s,a,s') + \gamma \max_{a'} Q(s',a') \right]$$

And optimal Q function $Q^*$ satisfies the Bellman equation:

$$B[Q^*(s,a)] = Q^*(s,a)$$

The value-iteration algorithm iteratively applies the Bellman operator: $$Q^{n+1}(s,a) = B[Q^n(s,a)]$$ And is proven to converge to the optimal Q function: $$Q^*(s,a) = \lim_{n\to\infty}Q^{n}(s,a)$$

Now, in the DQN algorithm we are approximating Q functions with DNNs with parameters $\theta$: $Q(s,a; \theta)$. Then we approximate the value-iteration algorithm by minimizing the norm

$$\mathbb{E}_{s,a} \left\|B[Q(s,a; \theta_{i-1})] - Q(s,a; \theta_i)\right\|$$

The minimization is performed over parameters $\theta_i$ with previous parameters $\theta_{i-1}$ held fixed.

The averages $\mathbb{E}_{s,a}$ and $\mathbb{E}_{s'}$ are approximated by sampling a minibatch $MB$ of $(s,a,r,s')$ tuples from the replay memory.

I suppose that is the closest I can get to the crux of your question. You are claiming that the average in the bellman operator application is approximated by a single point: $$B[Q(s,a)] \simeq r(s,a,s') + \gamma \max_{a'} Q(s',a')$$ While, in fact it is approximated by $$B[Q(s,a)] \simeq \mathbb{E}_{s'\in MB} \left[ r(s,a,s') + \gamma \max_{a'} Q(s',a')\right]$$

Kostya
  • 2,416
  • 7
  • 23