For questions related to TD($\lambda$) family of algorithms.
Questions tagged [td-lambda]
12 questions
9
votes
2 answers
What is the intuition behind TD($\lambda$)?
I'd like to better understand temporal-difference learning. In particular, I'm wondering if it is prudent to think about TD($\lambda$) as a type of "truncated" Monte Carlo learning?

Nick Kunz
- 145
- 1
- 5
8
votes
2 answers
Why are lambda returns so rarely used in policy gradients?
I've seen the Monte Carlo return $G_{t}$ being used in REINFORCE and the TD($0$) target $r_t + \gamma Q(s', a')$ in vanilla actor-critic. However, I've never seen someone use the lambda return $G^{\lambda}_{t}$ in these situations, nor in any other…

jhinGhin
- 83
- 3
6
votes
1 answer
Can TD($\lambda$) be used with deep reinforcement learning?
TD lambda is a way to interpolate between TD(0) - bootstrapping over a single step, and, TD(max), bootstrapping over the entire episode length, or, Monte Carlo.
Reading the link above, I see that an eligibility trace is kept for each state in order…

Gulzar
- 729
- 1
- 8
- 23
5
votes
1 answer
Why not more TD() in actor-critic algorithms?
Is there either an empirical or theoretical reason that actor-critic algorithms with eligibility traces have not been more fully explored? I was hoping to find a paper or implementation or both for continuous tasks (not episodic) in continuous…

Nick Kunz
- 145
- 1
- 5
5
votes
2 answers
Why am I getting the incorrect value of lambda?
I am trying to solve for $\lambda$ using temporal-difference learning. More specifically, I am trying to figure out what $\lambda$ I need, such that $\text{TD}(\lambda)=\text{TD}(1)$, after one iteration. But I get the incorrect value of…

Amanda
- 205
- 1
- 5
2
votes
0 answers
How does bootstrapping work with the offline $\lambda$-return algorithm?
In Barton and Sutton's book, Reinforcement Learning: An Introduction (2nd edition), an expression, on page 289 (equation 12.2), introduced the form of the $\lambda$-return defined as follows
$$G_t^{\lambda} = (1-\lambda)\sum_{n=1}^{\infty}…

quest ions
- 384
- 1
- 8
2
votes
0 answers
Why is TD(0) not converging to the optimal policy?
I am trying to implement the basic RL algorithms to learn on this 10x10 GridWorld (from REINFORCEJS by Kaparthy).
Currently I am stuck at TD(0). No matter how many episodes I run, when I am updating the policy after all episodes are done according…

PeeteKeesel
- 121
- 3
2
votes
1 answer
How is $\Delta$ updated in true online TD($\lambda$)?
In the RL textbook by Sutton & Barto section 7.4, the author talked about the "True online TD($\lambda$)". The figure (7.10 in the book) below shows the algorithm.
At the end of each step, $V_{old} \leftarrow V(S')$ and also $S \leftarrow S'$. When…

roy
- 53
- 3
1
vote
0 answers
When do you back-propagate errors through a neural network when using TD($\lambda$)?
I have a neural network that I'm want to use to self-play Connect Four. The neural network receives the board state and is to provide an estimate of the state's value.
I would then, for each move, use the highest estimate, occasionally I will use…

NeomerArcana
- 210
- 3
- 12
1
vote
0 answers
What is 'eligibility' in intuitive terms in TD($\lambda$) learning?
I am watching the lecture from Brown University (in udemy) and I am in the portion of Temporal Difference Learning.
In the pseudocode/algorithm of TD(1) (seen in the screenshot below), we initialise the eligibility $e(s) =0$ for all states. Later…

cgo
- 175
- 5
1
vote
0 answers
How is the general return-based off-policy equation derived?
I'm wondering how is the general return-based off-policy equation in Safe and efficient off-policy reinforcement learning derived
$$\mathcal{R} Q(x, a):=Q(x, a)+\mathbb{E}_{\mu}\left[\sum_{t \geq 0} \gamma^{t}\left(\prod_{s=1}^{t}…

fish_tree
- 247
- 1
- 6
0
votes
1 answer
When using TD(λ), how do you calculate the eligibility trace per input & weight of a neural network neuron?
I have a Neural Network, each Neuron is made up of inputs, weights, and output. I have potentially multiple hidden layers. The activation function executed against the output is not known by the Neuron.
I would like to use TD(λ) to back-propagate…

NeomerArcana
- 210
- 3
- 12