Questions tagged [td-lambda]

For questions related to TD($\lambda$) family of algorithms.

12 questions
9
votes
2 answers

What is the intuition behind TD($\lambda$)?

I'd like to better understand temporal-difference learning. In particular, I'm wondering if it is prudent to think about TD($\lambda$) as a type of "truncated" Monte Carlo learning?
8
votes
2 answers

Why are lambda returns so rarely used in policy gradients?

I've seen the Monte Carlo return $G_{t}$ being used in REINFORCE and the TD($0$) target $r_t + \gamma Q(s', a')$ in vanilla actor-critic. However, I've never seen someone use the lambda return $G^{\lambda}_{t}$ in these situations, nor in any other…
6
votes
1 answer

Can TD($\lambda$) be used with deep reinforcement learning?

TD lambda is a way to interpolate between TD(0) - bootstrapping over a single step, and, TD(max), bootstrapping over the entire episode length, or, Monte Carlo. Reading the link above, I see that an eligibility trace is kept for each state in order…
5
votes
1 answer

Why not more TD() in actor-critic algorithms?

Is there either an empirical or theoretical reason that actor-critic algorithms with eligibility traces have not been more fully explored? I was hoping to find a paper or implementation or both for continuous tasks (not episodic) in continuous…
5
votes
2 answers

Why am I getting the incorrect value of lambda?

I am trying to solve for $\lambda$ using temporal-difference learning. More specifically, I am trying to figure out what $\lambda$ I need, such that $\text{TD}(\lambda)=\text{TD}(1)$, after one iteration. But I get the incorrect value of…
2
votes
0 answers

How does bootstrapping work with the offline $\lambda$-return algorithm?

In Barton and Sutton's book, Reinforcement Learning: An Introduction (2nd edition), an expression, on page 289 (equation 12.2), introduced the form of the $\lambda$-return defined as follows $$G_t^{\lambda} = (1-\lambda)\sum_{n=1}^{\infty}…
2
votes
0 answers

Why is TD(0) not converging to the optimal policy?

I am trying to implement the basic RL algorithms to learn on this 10x10 GridWorld (from REINFORCEJS by Kaparthy). Currently I am stuck at TD(0). No matter how many episodes I run, when I am updating the policy after all episodes are done according…
PeeteKeesel
  • 121
  • 3
2
votes
1 answer

How is $\Delta$ updated in true online TD($\lambda$)?

In the RL textbook by Sutton & Barto section 7.4, the author talked about the "True online TD($\lambda$)". The figure (7.10 in the book) below shows the algorithm. At the end of each step, $V_{old} \leftarrow V(S')$ and also $S \leftarrow S'$. When…
1
vote
0 answers

When do you back-propagate errors through a neural network when using TD($\lambda$)?

I have a neural network that I'm want to use to self-play Connect Four. The neural network receives the board state and is to provide an estimate of the state's value. I would then, for each move, use the highest estimate, occasionally I will use…
1
vote
0 answers

What is 'eligibility' in intuitive terms in TD($\lambda$) learning?

I am watching the lecture from Brown University (in udemy) and I am in the portion of Temporal Difference Learning. In the pseudocode/algorithm of TD(1) (seen in the screenshot below), we initialise the eligibility $e(s) =0$ for all states. Later…
1
vote
0 answers

How is the general return-based off-policy equation derived?

I'm wondering how is the general return-based off-policy equation in Safe and efficient off-policy reinforcement learning derived $$\mathcal{R} Q(x, a):=Q(x, a)+\mathbb{E}_{\mu}\left[\sum_{t \geq 0} \gamma^{t}\left(\prod_{s=1}^{t}…
0
votes
1 answer

When using TD(λ), how do you calculate the eligibility trace per input & weight of a neural network neuron?

I have a Neural Network, each Neuron is made up of inputs, weights, and output. I have potentially multiple hidden layers. The activation function executed against the output is not known by the Neuron. I would like to use TD(λ) to back-propagate…