How is the general return-based off-policy equation derived?

Asked Nov 16 '19 at 10:56

Active Jun 04 '20 at 17:09

Viewed 41 times

I'm wondering how is the general return-based off-policy equation in Safe and efficient off-policy reinforcement learning derived $$\mathcal{R} Q(x, a):=Q(x, a)+\mathbb{E}_{\mu}\left[\sum_{t \geq 0} \gamma^{t}\left(\prod_{s=1}^{t} c_{s}\right)\left(r_{t}+\gamma \mathbb{E}_{\pi} Q\left(x_{t+1}, \cdot\right)-Q\left(x_{t}, a_{t}\right)\right)\right]$$

If it is applied to TD($\lambda$), is this equation the forward view of TD($\lambda$)?

What is the difference between trace $c_s$ and eligibility trace?

edited Jun 04 '20 at 17:09

nbro

39,006
12
98
176

asked Nov 16 '19 at 10:56

fish_tree

How is the general return-based off-policy equation derived?

0 Answers0