3

In Sutton-Barto's book on page 63 (81 of the pdf): $$\mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s,A_t=\pi'(s)] = \mathbb{E}_{\pi'}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_{t} = s]$$

How does $\mathbb{E}$ suddenly change to $\mathbb{E}_{\pi'}$ and the $A_t = \pi'(s)$ term disappears?

Also, in general, in the conditional expectation, which distribution do we compute the expectation with respect to? From what I have seen, in $\mathbb{E}[X \mid Y]$, we always calculate the expected value over distribution $X$.

nbro
  • 39,006
  • 12
  • 98
  • 176
ZERO NULLS
  • 147
  • 8

1 Answers1

3

Also, in general, in the conditional expectation, which distribution do we compute the expectation with respect to? From what I have seen, in $\mathbb{E}[X|Y]$, we always calculate the expected value over distribution $X$.

No, for $\mathbb{E}[X|Y]$ we take expectation of $X$ with respect to the conditional distribution $X|Y$, i.e.

$$\mathbb{E}[X|Y] = \int_\mathbb{R} x p(x|y)dx\;;$$

where $p(x|y)$ is the density function of the conditional distribution. If your random variables are discrete then replace the integral with a summation. Also note that $\mathbb{E}[X|Y]$ is still a random variable in $Y$.

How does $\mathbb{E}$ suddenly change to $\mathbb{E}_{\pi '}$ and the $A_t = \pi '(s)$ term disappears?

This is because in this instance $\pi '(s)$ is a deterministic policy, i.e. in state $s$ the policy will take action $b$ with probability 1 and all other actions with probability 0. NB: this is the convention used in Sutton and Barto to denote a deterministic policy.

Without loss of generality, assume that $\pi'(s) = b$. The implication of this is that in the first expectation we have $$\mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) | S_t = s, A_t = \pi'(s) = b] = \sum_{s',r}p(s',r|s,a=b)(r + \gamma v(s'))\;,$$ and in the second expectation we have $$\mathbb{E}_{\pi'}[R_{t+1} + \gamma v(S_{t+1}) | S_t = s] = \sum_a\pi'(a|s)\sum_{s',r}p(s',r|s,a)(r + \gamma v(s'))\;;$$ However, we know that $\pi'(a|s) = 0 \; \forall a \neq b$, so this sum over $a$ would equal 0 for all $a$ except when $a=b$, in which case we know that $\pi'(b|s) = 1$, and so the expectation becomes

$$\mathbb{E}_{\pi'}[R_{t+1} + \gamma v(S_{t+1}) | S_t = s] = \sum_{s',r}p(s',r|s,a=b)(r + \gamma v(s'))\;;$$

and so we have equality of the two expectations.

David
  • 4,591
  • 1
  • 6
  • 25
  • This conversation has been [moved to chat](https://chat.stackexchange.com/rooms/109022/discussion-on-answer-by-david-ireland-how-does-mathbbe-suddenly-change-to). – nbro Jun 07 '20 at 12:41