How to implement REINFORCE with eligibility traces?

Asked Jan 20 '21 at 07:43

Active Jan 20 '21 at 12:32

Viewed 219 times

The pseudocode below is taken from Barto and Sutton's "Reinforcement Learning: an introduction". It shows an actor-critic implementation with eligibility traces. My question is: if I set $\lambda^{\theta}=1$ and replace $\delta$ with the immediate reward $R_t$, do I get a backwards implementation of REINFORCE?

edited Jan 20 '21 at 12:32

nbro

39,006
12
98
176

asked Jan 20 '21 at 07:43

Javier Ventajas Hernández

Well, I think, if you set $\lambda^{\theta}$ to 1, there will be no sense in eligibility traces. In this case you are going to update $\theta$ directly based on policy gradient multiplied by TD-error $\delta$ – Georgy Firsov Jan 20 '21 at 14:33
Well, not quite. $z^{\theta}$ will still hold $\nabla \pi(A|S, \theta)$ from previous steps, and will decay them based on $\gamma$. That means past updates will be credited in a decaying manner going forward. Since ET implement a backward view of the $\lambda$-return, and setting $\gamma=1$ there should result in MC updates. Is there anything wrong in this reasoning? – Javier Ventajas Hernández Jan 20 '21 at 15:04
Now that I think about it, setting both $\lambda^{\theta}=1$ and $\lambda^w=1$ should make this method become REINFORCE with baseline, shouldn't it? – Javier Ventajas Hernández Jan 20 '21 at 15:07

How to implement REINFORCE with eligibility traces?

0 Answers0