1

I'm wondering how is the general return-based off-policy equation in Safe and efficient off-policy reinforcement learning derived $$\mathcal{R} Q(x, a):=Q(x, a)+\mathbb{E}_{\mu}\left[\sum_{t \geq 0} \gamma^{t}\left(\prod_{s=1}^{t} c_{s}\right)\left(r_{t}+\gamma \mathbb{E}_{\pi} Q\left(x_{t+1}, \cdot\right)-Q\left(x_{t}, a_{t}\right)\right)\right]$$

If it is applied to TD($\lambda$), is this equation the forward view of TD($\lambda$)?

What is the difference between trace $c_s$ and eligibility trace?

nbro
  • 39,006
  • 12
  • 98
  • 176
fish_tree
  • 247
  • 1
  • 6

0 Answers0