2

enter image description here

I don't understand how the formula in the red circle is derived. The screenshot is taken from this paper

nbro
  • 39,006
  • 12
  • 98
  • 176
hijkzzz
  • 23
  • 3
  • I recently got wondering about pretty much the same question as well. How can, for example, Q-Learning work using Auto-Differentiation when a constant is subtracted from an output of the actual Q-Network, i.e. predicted Q-Value, to compute an update of the predicted Q-Value (given that constants are commonly dropped during differentiation)? That is: why is the baseline estinate/constant, i.e. the part not being dependent on the policy network, not dropped during differentiation? – Daniel B. Jan 17 '21 at 11:09

1 Answers1

1

I will refer to $\mathcal T^{\pi} $as $\mathcal T$ and $P^{\pi}$ as $P$ for notational simplicity \begin{align} (\mathcal{T})^{n+1} Q &= \mathcal{T}(\mathcal{T}(...(\mathcal{T}(Q))))\\ &= r + \gamma P(r + \gamma P(...(r + \gamma P Q)))\\ &= r + r\sum_{i=1}^{n} \gamma^i P^i + \gamma^{n+1} P^{n+1} Q \end{align}

\begin{align} \mathcal{T}_{\lambda}Q &= (1-\lambda) \sum_{n=0}^{\infty} \lambda^n (\mathcal{T})^{n+1} Q\\ &=(1-\lambda)\{\lambda^0 (\mathcal T)^1Q + \lambda^1 (\mathcal T)^2Q + \lambda^2 (\mathcal T)^3Q + \ldots \} \end{align}

when you plug in expression for $(\mathcal T)^i Q$ inside this sum and rearrange you get 3 sums \begin{equation} \mathcal{T}_{\lambda}Q = (1-\lambda) \sum_{n=0}^{\infty} \lambda^n r + (1-\lambda)\sum_{n=1}^{\infty} \lambda^n \gamma^n P^n r + (1-\lambda)\sum_{n=0}^{\infty} \lambda^n \gamma^{n+1} P^{n+1} Q \end{equation}

  1. sum: \begin{equation} (1-\lambda) \sum_{n=0}^{\infty} \lambda^n r = r \end{equation}
  2. sum: \begin{equation} (1-\lambda)\sum_{n=1}^{k} \lambda^n \gamma^n P^n r = (1-\lambda\gamma P)^{-1}(1 - \lambda^k \gamma^k P^k)\lambda\gamma P r \end{equation} As $k \rightarrow \infty$ and since $\gamma < 1$ this is in the limit equal to \begin{equation} (1-\lambda)\sum_{n=1}^{\infty} \lambda^n \gamma^n P^n r = (1-\lambda\gamma P)^{-1}\lambda\gamma P r \end{equation}
  3. sum: \begin{equation} (1-\lambda)\sum_{n=0}^{\infty} \lambda^n \gamma^{n+1} P^{n+1} Q = (1 - \gamma\lambda P)^{-1}(1-\lambda)\gamma P Q \end{equation} If you combine all 3 you get \begin{align} \mathcal{T}_{\lambda}Q &= r + (1-\lambda\gamma P)^{-1}\lambda\gamma P r + (1 - \gamma\lambda P)^{-1}(1-\lambda)\gamma P Q\\ &= r+ (1-\lambda\gamma P)^{-1}(\lambda \gamma P r + \gamma PQ - \lambda\gamma PQ)\\ &= r+ (1-\lambda\gamma P)^{-1}(\lambda \gamma P r + (\mathcal T)Q - r - \lambda\gamma PQ)\\ &= (1-\lambda\gamma P)^{-1}(r - \lambda \gamma P r + \lambda \gamma P r + (\mathcal T)Q - r - \lambda\gamma PQ)\\ &= (1-\lambda\gamma P)^{-1}( (\mathcal T)Q - \lambda\gamma PQ + Q - Q)\\ &= Q + (1-\lambda\gamma P)^{-1}( (\mathcal T)Q - Q) \end{align}
Brale
  • 2,306
  • 1
  • 5
  • 14