4

In equation 3.17 of Sutton and Barto's book:

$$q_*(s, a)=\mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a]$$

$G_{t+1}$ here have been replaced with $v_*(S_{t+1})$, but no reason has been provided for why this step has been taken.

Can someone provide the reasoning behind why $G_{t+1}$ is equal to $v_*(S_{t+1})$?

nbro
  • 39,006
  • 12
  • 98
  • 176
ZERO NULLS
  • 147
  • 8

2 Answers2

3

Can someone provide the reasoning behind why $G_{t+1}$ is equal to $v_*(S_{t+1})$?

The two things are not usually exactly equal, because $G_{t+1}$ is a probability distribution over all possible future returns whilst $v_*(S_{t+1})$ is a probability distribution derived over all possible values of $S_{t+1}$. These will be different distributions much of the time, but their expectations are equal, provided the conditions of the expectation match.

In other words,

$$G_{t+1} \neq v_*(S_{t+1})$$

But

$$\mathbb{E}[G_{t+1}] = \mathbb{E}[v_*(S_{t+1})]$$

. . . when the conditions that apply to the expectations on each side are compatible. The relevant conditions are

  • Same initial state or state/action at given timestep $t$ (or you could pick any earlier timestep)

  • Same state progression rules and reward structure (i.e. same MDP)

  • Same policy

More details

The definition of $v(s)$ can be given as

$$v(s) = \mathbb{E}_\pi[G_t \mid S_t = s]$$

If you substitute step s' and index $t+1$ you get

$$v(s') = \mathbb{E}_\pi[G_{t+1} \mid S_{t+1} = s']$$

(This is the same equation, true by definition, the substitution just shows you how it fits).

In order to put this into equation 3.17, you need to note that:

  • It is OK to substitute terms inside an expectation if they are equal in separate expections, amd the conditions $c$ and $Y$ apply to both (or are irrelevant to either one or both). So if for example $\mathbb{E}_c[Z] = \mathbb{E}_c[X \mid Y]$ where $X$ and $Z$ are random variables, and you know $Z$ is independent of $Y$ then you can say $\mathbb{E}_c[W + 2X \mid Y] = \mathbb{E}_c[W + 2Z \mid Y]$ even if $X$ and $Z$ are different distributions.

  • $A_{t+1} = a'$ does not need to be specified because it is decided by the same $\pi$ in both $q(s,a)$ and $v(s')$, making the conditions on the expectation compatible already. So the condition of following $\pi$ is compatible with $\mathbb{E}_\pi[G_{t+1} \mid S_{t} = s, A_{t}=a] = \mathbb{E}_\pi[v_*(S_{t+1}) \mid S_{t} = s, A_{t}=a]$

  • The expectation over possible $s'$ in $\mathbb{E}_\pi[v_*(S_{t+1})|S_t=s, A_t=a] = \sum p(s'|s,a)v_*(s')$ is already implied by conditions on the original expectation that the functions are evaluating the same environment - something that is not usually shown in the notation.

Also worth noting, in 3.17 $\pi$ is the optimal policy $\pi^*$, but actually the equation holds for any fixed policy.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
2

Note that for a general policy $\pi$ we have that $q_{\pi}(s,a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a]$, where in state $S_t$ we take action $a$ and thereafter following policy $\pi$. Note that the expectation is taken with respect to the reward transition distribution $\mathbb{P}(R_{t+1} = r, S_{t+1} = s' | A_t = a, S_t = s)$ which I will denote as $p(s',r,|s,a)$.

We can then rewrite the expectation as follows

\begin{align} q_{\pi}(s,a) &= \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a] \\ & = \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a] \\ & = \sum_{r,s'}p(s',r|s,a)(r + \gamma \mathbb{E}_\pi[G_{t+1} | S_{t+1} = s']) \\ & = \sum_{r,s'}p(s',r|s,a)(r + \gamma v_{\pi}(s')) \; . \end{align}

The key thing to note is that these two terms, $G_{t+1}$ and $v_{\pi}(s')$, are only equal in expectation, which is why in the equation you can exchange the terms because we are taking the expectation.

Note that I have shown this for a general policy $\pi$ not just the optimal policy.

David
  • 4,591
  • 1
  • 6
  • 25