What does $v(S_{t+1})$ mean in the optimal state-action value function?

Question

In Sutton & Barto's Reinforcement Learning: An Introduction page 63 the authors introduce the optimal state value function in the expression of the optimal action-value function as follows: $q_{*}(s,a)=\mathbb{E}[R_{t+1}+\gamma v_{*}(S_{t+1})|S_{t}=s, A_{t}=a], \forall s \in S, \forall a \in A$.

I don't understand what $v_{*}(S_{t+1})$ could possibly mean since $v_{*}$ is a mapping, under the optimal policy $\pi_{*}$, from states to numbers which are expected returns starting from those states and at different time steps.

I believe that the authors use the same notation to denote the state-value function $v$ that verify $v(s)=\mathbb{E}[G_{t}|S_{t}=s], \forall s \in S$ and the random variable $\mathbb{E}[G_{t+1}|S_{t+1}]$ but I'm not sure.

Neil Slater · Accepted Answer · 2021-06-14T20:48:30.703

2

I am not sure if it is standard notation, but Sutton & Barto use a convention that a function of a random variable is a new random variable that maps between values of the old variable to values of the new one using the function, and without affecting probability distribution (other than the function could be one-way hence probabilties may effectively combine e.g. if there were several states with $v_*(s) = 5$).

Given this convention then $v_*(S_{t+1})$ is a random variable of the optimal state value functions of possible statues at time step $t+1$. That is, it has the same probability densities, based on policy and state transition rules, as $S_{t+1}$, but has the associated values of $v_*(s)$ for each possible $S_{t+1}$.

The actual distribution of $v_{*}(S_{t+1})$ will vary depending on the conditions in the context where it is evaluated.

If you resolve the expectations in the first equation, which has conditions on $S_t$ and $A_t$:

$q_{*}(s,a)=\mathbb{E}[R_{t+1}+\gamma v_{*}(S_{t+1})|S_{t}=s, A_{t}=a]$

$\qquad\quad= \sum_{r,s'} p(r,s'|s,a)(r + \gamma v_*(s'))$

. . . which expresses action value $q_*(s,a)$ in terms of the state transition rules, immediate reward function and the state value $v_*(s')$ one half-step ahead. That is, at the next state, but before the next (optimal choice) action is taken.

edited Jun 14 '21 at 20:48

answered Jun 14 '21 at 20:43

Neil Slater

28,678
3
38
60

I'm not sure if I understood well so let me please ask a question in general terms. If $X$ is a random variable from $\Omega$ to $\mathbb{R}$ and $f$ is a function on $\mathbb{R}$ then according to the convention, $f(X)$ is a random variable from $\Omega$ to $\mathbb{R}$ and the values of this random variable are $f(x)$ for $x$ in $\Omega$ but $P[f(X)=f(x)]=P[X=x]$ right ? – Daviiid Jun 14 '21 at 20:55
@Daviiid Yes I think that has correctly put my first two paragraphs into notation. Sort of, I think you have some typos, i.e. $X$ is a random variable in $\Omega$ (the "to $\mathbb{R}$" after that looks like a typo), and $f(x): \Omega \rightarrow \mathbb{R}$ – Neil Slater Jun 14 '21 at 20:57
Thank you for your answer. Can I ask please if there is a particular reason for going with this convention ? I think we can still get the same formula with the normal association of functions and random variables; $\mathbb{E}[v_{*}(S_{t+1})|S_{t}=s,A_{t}=a]=\sum_{s'}{p(s'|s,a)v_{*}(s')}=\sum_{r,s'}{p(r,s'|s,a)v_{*}(s')}$ since $\sum_{r}{p(r,s'|s,a)}=1$, or am I missing something ? – Daviiid Jun 14 '21 at 21:01
@Daviiid I cannot see any difference between what you have written and I have done in the question, in terms of needing to interpret $v_*(S_{t+1})$. You have just resolved the expectation to the sum more conventionally (the double-variable sum $\sum_{r,s'}$ is a different conventon I have borrowed from Sutton & Barto). But I don't think it uses a different interpretation of what $v_*(S_{t+1})$ is? Sorry I am not a notation expert, and may have missed some subtlety. – Neil Slater Jun 14 '21 at 21:06
I'd like to apologize first for not being able to add your name with the @. There is no difference between what I've written and your answer I just wanted to rewrite it as a clarification to my question. I just wanted to know if there is a particular reason for the authors for choosing their convention for functions of random variables. Because normally a function of a random variable would change the distribution of probabilities, we just imagine a scalar multiplication of a Gaussian random variable and how the distribution changes. – Daviiid Jun 14 '21 at 21:43
@Daviiid: OK, I think I understand the concern. With discrete probability distributions then the problems with the convention are minor and do not impact any of the sums. For continuous distributions you have to be more careful, although I think most of the manipulations of the expectations etc in RL would still be fine. – Neil Slater Jun 15 '21 at 06:50
I understand now. Thank you for the explanations and all your clarifications. – Daviiid Jun 15 '21 at 07:58

What does $v(S_{t+1})$ mean in the optimal state-action value function?

1 Answers1