6

In the original DQN paper, page 1, the loss function of the DQN is

$$ L_{i}(\theta_{i}) = \mathbb{E}_{(s,a,r,s') \sim U(D)} [(r+\gamma \max_{a'} Q(s',a',\theta_{i}^{-}) - Q(s,a;\theta_{i}))^2] $$

whose gradient is presented (on page 7)

$$\nabla_{\theta_i} L_i(\theta_i) = \mathbb{E}_{s,a,r,s'} [(r+\gamma \max_{a'}Q(s',a';\theta_i^-) - Q(s,a;\theta_i))\nabla_{\theta_i}Q(s,a;\theta_i)] $$

But why is there no minus (-) sign if $-Q(s,a;\theta_i)$ is parameterized by $\theta_i$ and why is the 2 from power gone?

nbro
  • 39,006
  • 12
  • 98
  • 176

1 Answers1

2

In general, if you have a composite function $h(x) = g(f(x))$, then $\frac{dh}{dx} = \frac{d g}{df} \frac{d f}{dx}$. In your case, the function to differentiate is

$$L_{i}(\theta_{i}) = \mathbb{E}_{(s,a,r,s') \sim U(D)} \left[ \left(r+\gamma \max_{a '} Q(s',a',\theta_{i}^{-}) - Q(s,a;\theta_{i}) \right)^2 \right]$$

So, we want to calculate $\nabla_{\theta_i} L_{i}(\theta_{i})$, which is equal to

$$\nabla_{\theta_i} \mathbb{E}_{(s,a,r,s') \sim U(D)} \left[ \left(r+\gamma \max_{a '} Q(s',a',\theta_{i}^{-}) - Q(s,a;\theta_{i}) \right)^2 \right] \label{2} \tag{2} $$

For clarity, let's ignore the iteration number $i$ in \ref{2}, which can thus more simply be written as

$$\nabla_{\theta} \mathbb{E}_{(s,a,r,s') \sim U(D)} \left[ \left(r + \gamma \max_{a'} Q(s',a',\theta^{-}) - Q(s,a; \theta ) \right)^2 \right] \label{3} \tag{3} $$

The subscript of the expected value operator $\mathbf{e}=(s,a,r,s') \sim U(D)$ means that the expected value is being taken with respect to the multivariate random variable, $\mathbf{E}$ (for experience), whose values (or realizations) are $\mathbf{e}=(s,a,r,s')$, and that follows the distribution $U(D)$ (a uniform distribution), that is, $\mathbf{e}=(s,a,r,s')$ are uniformly drawn from the experience replay buffer, $D$ (or $\mathbf{E} \sim U(D)$). However, let's ignore or omit this subscript for now (because there are no other random variables in \ref{3}, given that $\gamma$ and $a'$ should not be random variables, thus it should not be ambiguous with respect to which random variable the expectation is being calculated), so \ref{3} can be written as

$$\nabla_{\theta} \mathbb{E} \left[ \left(r + \gamma \max_{a'} Q(s',a',\theta^{-}) - Q(s,a; \theta ) \right)^2 \right] \label{4} \tag{4} $$

Now, recall that, in the case of a discrete random variable, the expected value is a weighted sum. In the case of a continuous random variable, it is an integral. So, if $\mathbf{E}$ is a continuous random variable, then the expectation \ref{4} can be expanded to

$$ \int_{\mathbb{D}} {\left(r + \gamma \max_{a'} Q(s',a',\theta^{-}) - Q(s,a; \theta )\right)}^2 f(\mathbf{e}) d\mathbf{e} $$

where $f$ is the density function associated with $\mathbf{E}$ and $\mathbb{D}$ is the domain of the random variable $\mathbf{E}$.

The derivative of an integral can be calculated with the Leibniz integral rule. In the case the bounds of the integration are constants ($a$ and $b$), then the Leibniz integral rule reduces to

$${\displaystyle {\frac {d}{dx}}\left(\int _{a}^{b}f(x,t)\,dt\right)=\int _{a}^{b}{\frac {\partial }{\partial x}}f(x,t)\,dt.}$$

Observe that the derivative is taken with respect to the variable $x$, while the integration is taken with respect to variable $t$.

In our case, the domain of integration, $\mathbb{D}$, is constant because it only represents all experience in the dataset, $\mathcal{D}$. Therefore, the gradient in \ref{4} can be written as

\begin{align} \nabla_{\theta} \int_{\mathbb{D}} \left(r + \gamma \max_{a'} Q(s',a',\theta^{-}) - Q(s,a; \theta )\right)^2 f(\mathbf{e}) d\mathbf{e} &= \\ \int_{\mathbb{D}} \nabla_{\theta} \left( \left( r + \gamma \max_{a'} Q(s',a',\theta^{-}) - Q(s,a; \theta )\right)^2 f(\mathbf{e}) \right) d\mathbf{e} &=\\ \mathbb{E} \left[ \nabla_{\theta} { \left(r + \gamma \max_{a'} Q(s',a',\theta^{-}) - Q(s,a; \theta) \right)}^2 \right] \label{5}\tag{5} \end{align}

Recall that the derivative of $f(x)=x^2$ is $f'(x)=2x$ and that the derivative of a constant is zero. Note now that the only term in \ref{5} that contains $\theta$ is $Q(s,a; \theta)$, so all other terms are constant with respect to $\theta$. Hence, \ref{5} can be written as

\begin{align} \mathbb{E} \left[ \nabla_{\theta} {(r + \gamma \max_{a'} Q(s',a',\theta^{-}) - Q(s,a; \theta))}^2 \right] &=\\ \mathbb{E} \left[ 2 {(r + \gamma \max_{a'} Q(s',a',\theta^{-}) - Q(s,a; \theta))} \nabla_{\theta} \left( {r + \gamma \max_{a'} Q(s',a',\theta^{-}) - Q(s,a; \theta)} \right) \right] &=\\ \mathbb{E} \left[ 2 {(r + \gamma \max_{a'} Q(s',a',\theta^{-}) - Q(s,a; \theta))} \left( {\nabla_{\theta} r + \nabla_{\theta} \gamma \max_{a'} Q(s',a',\theta^{-}) - \nabla_{\theta} Q(s,a; \theta)} \right) \right] &=\\ \mathbb{E} \left[ - 2 {(r + \gamma \max_{a'} Q(s',a',\theta^{-}) - Q(s,a; \theta))} {\nabla_{\theta} Q(s,a; \theta)} \right] &=\\ -2 \mathbb{E} \left[ {(r + \gamma \max_{a'} Q(s',a',\theta^{-}) - Q(s,a; \theta))}{\nabla_{\theta} Q(s,a; \theta)} \right] \end{align}

The $-2$ disappears because it can be absorbed by the learning rate.

nbro
  • 39,006
  • 12
  • 98
  • 176