What are the recurrences used for updating state value function in $TD$ and $TD(\lambda)$ learning?

Question

There are two types of value functions in reinforcement learning: State value function $V^{\pi} (s)$, state-action value function $Q^{\pi}(s, a)$.

State value function:

This value tells us how good to be in state $s$ if we are following policy $\pi$. Formally, it can be defined as the average returns obtained at time step $t$ from state $s$ if we follow policy $\pi$.

$$V^{\pi}(s) = \mathbb{E}_{\pi}[R_{t}|s_t = s] = \mathbb{E}_{\pi} \left[ \sum \limits_{k=0}^{\infty} \gamma^{k}r_{t+k+1} \mid s_t = s\right] = \mathbb{E}_{\pi} \left[ \sum \limits_{k=0}^{\infty} \gamma^{k}r_{t+k+1} \mid s_t = s, a_t = a \right]$$

State-action value function:

This value tells us how good is to to perform action $a$ in state $s$ if we are following policy $\pi$. Formally, it can be defined as the average returns obtained at time step $t$ from state $s$ and action $a$ if we follow policy $\pi$ further.

$$Q^{\pi}(s, a) = \mathbb{E}_{\pi}[R_{t}|s_t = s, a_t = a] = \mathbb{E}_{\pi} \left[ \sum \limits_{k=0}^{\infty} \gamma^{k}r_{t+k+1} \mid s_t = s, a_t = a\right] = \mathbb{E}_{\pi} \left[ \sum \limits_{k=0}^{\infty} \gamma^{k}r_{t+k+1} \mid s_t = s, a_t = a \right]$$

Now, Q-learning and SARSA learning algorithms are generally used to update $Q$ function under policy $\pi$ using the following recurrences respectively

$$Q(s_t,a_t) = Q(s_t,a_t) + \alpha[r_{t+1} + \gamma \max\limits_{a} Q(s_{t+1},a) - Q(s_t,a_t)] $$

$$Q(s_t,a_t) = Q(s_t,a_t) + \alpha[r_{t+1} + \gamma Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)] $$

Now my doubt is about the recurrence relations in Temporal Difference (TD) algorithms that update state value functions. Are they same as the recurrences provided above?

$$V(s_t) = V(s_t) + \alpha[r_{t+1} + \gamma \max V(s_{t+1}) - V(s_t)] $$

$$V(s_t) = V(s_t) + \alpha[r_{t+1} + \gamma V(s_{t+1}) - V(s_t)] $$

If yes, what are the names of the algorithms that uses these recurrences?

Q-learning and SARSA are TD learning algorithms; so, no, TD is not limited to algorithms that update the state value function. Q-learning and SARSA are also _control_ algorithms, so they find policies (i.e. [controllers](https://ai.stackexchange.com/a/23427/2444)). They are also TD(0) algorithms. To have an idea of what TD-lambda is, I think my answer [here](https://ai.stackexchange.com/q/17605/2444) should address it. Let me know if I should close this post as a duplicate of that one. — nbro, Aug 10 '21 at 00:30
In any case, it seems to me that you're asking multiple questions here. I'd suggest that you ask only 1, and clarify how your question is different from others. In any case, to fully understand TD-lambda, you need to understand TD learning well first, then Q-learning. I would suggest that you read the related chapter of Sutton & Barto. It may take some time to get used to all this. — nbro, Aug 10 '21 at 00:34
@nbro Now, I am in need of only recurrence equations. That is the reason for asking. I will try to edit. — hanugm, Aug 10 '21 at 00:42
@nbro I will surely read that book. But for now, I want the recurrences only. I am hoping that I will read in detail further. — hanugm, Aug 10 '21 at 00:52
There is repetition in your breakdown of $Q^\pi$ and it doesn't seem to go anywhere - is there something missing or a mistake there? Also your analysis of $V^\pi$ is wrong, you add a conditional $a_t=a$ from out of nowhere. Did you intend to write those equations as they are now, or have you copied them from some reference (the reference is plain wrong for $V^\pi$ if so)? — Neil Slater, Aug 10 '21 at 07:27
Ok, if you're interested only in the recursive relation of TD($\lambda$) (so focus on one algorithm at a time), please, edit your post to ask **only** that. If you're interested in knowing if there is a counter-part of the recursive relations of Q-learning and SARSA, please, edit your post to ask **only** that. If you have any other question, again, edit your post to leave just that question. If you have multiple questions, it's better to split the post into multiple ones, one for each question. It really seems to me that you're asking multiple questions here, although they are related. — nbro, Aug 11 '21 at 12:20
Thanks! Please, make also sure that the question in the title matches the one in the body of the post. — nbro, Aug 11 '21 at 12:42

What are the recurrences used for updating state value function in $TD$ and $TD(\lambda)$ learning?

0 Answers0