Questions tagged [eligibility-traces]

For questions related to the reinforcement learning technique called "eligibility traces", which combines temporal-difference and Monte Carlo methods.

15 questions
6
votes
1 answer

Can TD($\lambda$) be used with deep reinforcement learning?

TD lambda is a way to interpolate between TD(0) - bootstrapping over a single step, and, TD(max), bootstrapping over the entire episode length, or, Monte Carlo. Reading the link above, I see that an eligibility trace is kept for each state in order…
5
votes
1 answer

Why not more TD() in actor-critic algorithms?

Is there either an empirical or theoretical reason that actor-critic algorithms with eligibility traces have not been more fully explored? I was hoping to find a paper or implementation or both for continuous tasks (not episodic) in continuous…
4
votes
1 answer

How to apply or extend the $Q(\lambda)$ algorithm to semi-MDPs?

I want to model an SMDP such that time is discretized and the transition time between the two states follows an exponential distribution and there would be no reward between the transition. Can I know what are the differences between $Q(\lambda)$…
3
votes
0 answers

How to implement REINFORCE with eligibility traces?

The pseudocode below is taken from Barto and Sutton's "Reinforcement Learning: an introduction". It shows an actor-critic implementation with eligibility traces. My question is: if I set $\lambda^{\theta}=1$ and replace $\delta$ with the immediate…
3
votes
0 answers

Why weighting by lambda that sums to 1 ensures convergence in eligibility trace?

In Sutton and Barto's Book in chapter 12, they state that if weights sum to 1, then an equation's updates have "guaranteed convergence properties". Actually why it ensures convergence? There is a full citation from the mentioned fragment in Richard…
Daniel Wiczew
  • 323
  • 2
  • 10
2
votes
1 answer

Does eligibility traces and epsilon-greedy do the same task in different ways?

I understand that, in Reinforcement Learning algorithms, such as Q-learning, to prevent selecting the actions with greatest q-values too fast and allow for exploration, we use eligibility traces. Here are some questions Does $\epsilon$-greedy solve…
2
votes
0 answers

Watkins' Q(λ) with function approximation: why is gradient not considered when updating eligibility traces for the exploitation phase?

I'm implementing the Watkins' Q(λ) algorithm with function approximation (in 2nd edition of Sutton & Barto). I am very confused about updating the eligibility traces because, at the beginning of chapter 9.3 "Control with Function Approximation",…
2
votes
1 answer

How to prove the formula of eligibility traces operator in reinforcement learning?

I don't understand how the formula in the red circle is derived. The screenshot is taken from this paper
2
votes
1 answer

How do I derive the gradient with respect to the parameters of the softmax policy?

The gradient of the softmax eligibility trace is given by the following: \begin{align} \nabla_{\theta} \log(\pi_{\theta}(a|s)) &= \phi(s,a) - \mathbb E[\phi (s, \cdot)]\\ &= \phi(s,a) - \sum_{a'} \pi(a'|s) \phi(s,a') \end{align} How is this equation…
2
votes
1 answer

How can the $\lambda$-return be defined recursively?

The $\lambda$-return is defined as $$G_t^\lambda = (1-\lambda)\sum_{n=1}^\infty \lambda^{n-1}G_{t:t+n}$$ where $$G_{t:t+n} = R_{t+1}+\gamma R_{t+2}+\dots +\gamma^{n-1}R_{t+n} + \gamma^n\hat{v}(S_{t+n})$$ is the $n$-step return from time $t$. How can…
1
vote
0 answers

What is 'eligibility' in intuitive terms in TD($\lambda$) learning?

I am watching the lecture from Brown University (in udemy) and I am in the portion of Temporal Difference Learning. In the pseudocode/algorithm of TD(1) (seen in the screenshot below), we initialise the eligibility $e(s) =0$ for all states. Later…
1
vote
1 answer

Applying Eligibility Traces to Q-Learning algorithm does not improve results (And might not function well)

I am trying to apply Eligibility Traces to a currently working Q-Learning algorithm. The reference code for the Q-Learning algorithm was taken from this great blog by DeepLizard, but does not include Eligibility Traces. Link to the code on Google…
1
vote
0 answers

How is the general return-based off-policy equation derived?

I'm wondering how is the general return-based off-policy equation in Safe and efficient off-policy reinforcement learning derived $$\mathcal{R} Q(x, a):=Q(x, a)+\mathbb{E}_{\mu}\left[\sum_{t \geq 0} \gamma^{t}\left(\prod_{s=1}^{t}…
1
vote
0 answers

Eligibility trace In Model-based Reinforcement Learning

In model-based reinforcement learning algorithms, the model of the environment is constructed to efficiently use samples, models such as Dyna, and Prioritize Sweeping. Moreover, eligibility trace helps the model learns (action) value functions…
0
votes
1 answer

How to deal with delay in reinforcement learning, an unclear case

According to the question in How to deal with the time delay in reinforcement learning?, we can tell the delay in the reinforcement learning can be observation delay, action delay and reward delay. I have a special case of the delay but I am not…