11

In section 7.1 (about the n-step bootstrapping) of the book Reinforcement Learning: An Introduction (2nd edition), by Andrew Barto and Richard S. Sutton, the authors write about what they call the "n-step return error reduction property":

enter image description here

But they don't prove it. I was thinking it should not be too hard but how can we show this? I was thinking of using the definition of n-step return (eq. 7.1 on previous page):

$$G_{t:t+n} = R_{t+1} + \gamma*R_{t+2} + ... + \gamma^{n-1}*R_{t+n} + \gamma^{n}*V_{t+n-1}(S_{t+n})$$

Because then this has the $V_{t+n-1}$ in it already. But in the definition above of the n-step return it uses $V_{t+n-1}(S_{t+n})$ but on the right side of the inequality (7.3) that we want to prove it is just little s $V_{t+n-1}(s)$ ? So kind of confused here which state s it is using? And then I guess after this probably pull out a $\gamma^{n}$ term or something, how should we go from here?

This is the newest Sutton Barto book (book page 144, equation 7.3): https://drive.google.com/file/d/1opPSz5AZ_kVa1uWOdOiveNiBFiEOHjkG/view

nbro
  • 39,006
  • 12
  • 98
  • 176
123learn
  • 111
  • 3

2 Answers2

10

Let's start by looking at:

$$\max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ G_{t:t+n} \mid S_t = s \right] - v_{\pi}(s) \Bigr\rvert.$$

We can rewrite this by plugging in the definition of $G_{t:t+n}$:

\begin{aligned} & \max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ G_{t:t+n} \mid S_t = s \right] - v_{\pi}(s) \Bigr\rvert \\ % =& \max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ R_{t + 1} + \gamma R_{t + 2} + \dots + \gamma^{n - 1} R_{t + n} + \gamma^n V_{t + n - 1}(S_{t + n}) \mid S_t = s \right] - v_{\pi}(s) \Bigr\rvert \\ % =& \max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ R_{t:t+n} + \gamma^n V_{t + n - 1}(S_{t + n}) \mid S_t = s \right] - v_{\pi}(s) \Bigr\rvert, \end{aligned}

where $R_{t:t+n} \doteq R_{t + 1} + \gamma R_{t + 2} + \dots + \gamma^{n - 1} R_{t + n}$.

If you go all the way back to page 58 of the book, you can see the definition of $v_{\pi}(s)$:

\begin{aligned} v_{\pi}(s) &\doteq \mathbb{E}_{\pi} \left[ \sum_{k = 0}^{\infty} \gamma^k R_{t + k + 1} \mid S_t = s \right] \\ % &= \mathbb{E}_{\pi} \left[ R_{t:t+n} + \gamma^n \sum_{k = 0}^{\infty} \gamma^k R_{t + n + k + 1} \mid S_t = s \right] \\ % &= \mathbb{E}_{\pi} \left[ R_{t:t+n} \mid S_t = s \right] + \gamma^n \mathbb{E}_{\pi} \left[ \sum_{k = 0}^{\infty} \gamma^k R_{t + n + k + 1} \mid S_t = s \right] \end{aligned}

Using this, we can continue rewriting where we left off above:

\begin{aligned} & \max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ R_{t:t+n} + \gamma^n V_{t + n - 1}(S_{t + n}) \mid S_t = s \right] - v_{\pi}(s) \Bigr\rvert \\ % =& \max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ \gamma^n V_{t + n - 1}(S_{t + n}) \mid S_t = s \right] - \gamma^n \mathbb{E}_{\pi} \left[ \sum_{k = 0}^{\infty} \gamma^k R_{t + n + k + 1} \mid S_t = s \right] \Bigr\rvert \\ % =& \gamma^n \max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ V_{t + n - 1}(S_{t + n}) - \sum_{k = 0}^{\infty} \gamma^k R_{t + n + k + 1} \mid S_t = s \right] \Bigr\rvert \end{aligned}

Because the absolute value function is convex, we can use Jensen's inequality to show that the absolute value of an expectation is less than or equal to the expectation of the corresponding absolute value:

$$\left| \mathbb{E} \left[ X \right] \right| \leq \mathbb{E} \left[ \left| X \right| \right].$$

This means that:

\begin{aligned} \max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ G_{t:t+n} - v_{\pi}(s) \mid S_t = s \right] \Bigr\rvert &\leq \gamma^n \max_s \mathbb{E}_{\pi} \left[ \Bigl\lvert V_{t + n - 1}(S_{t + n}) - \sum_{k = 0}^{\infty} \gamma^k R_{t + n + k + 1} \Bigr\rvert \mid S_t = s \right] \\ % \end{aligned}

Now, the important trick here is to see that:

$$\max_s \mathbb{E}_{\pi} \left[ \Bigl\lvert V_{t + n - 1}(S_{t + n}) - \sum_{k = 0}^{\infty} \gamma^k R_{t + n + k + 1} \Bigr\rvert \mid S_t = s \right] \leq \max_s \mathbb{E}_{\pi} \left[ \Bigl\lvert V_{t + n - 1}(S_{t}) - \sum_{k = 0}^{\infty} \gamma^k R_{t + k + 1} \Bigr\rvert \mid S_t = s \right]$$

I'm skipping the formal steps to show that this is the case to save space, but the intuition is that:

  • The left-hand side of this inequality involves finding an $S_t = s$ such that some function of $S_{t + n}$ is maximized, whereas the right-hand side involves finding an $S_t = s$ such that exactly the same function of $S_{t}$ is maximized.
  • In the left-hand side, selecting an $S_t = s$ implicitly induces a probability distribution over multiple possible states $S_{t + n}$, given by $S_t$, the environment's transition dynamics, and the policy $\pi$. Intuitively, this is more "restrictive" for the $\max$ operator, it does not have the "freedom" to directly select a single state $S_{t + n}$ such that the function of $S_{t + n}$ is maximized. The right-hand side is free to choose any single state $S_t = s$ such that $S_t$ in the right-hand side were equal to an "optimal" $S_{t+n}$ on the left-hand side, but it is also free to make even better choices which might never be uniquely reachable after $n$ steps on the left-hand side.

We can use this to rewrite the previous inequality we had (where we might be making the right-hand side a bit bigger than it was, but that's fine, it already was an upper bound anyway so that inequality will still hold):

\begin{aligned} \max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ G_{t:t+n} - v_{\pi}(s) \mid S_t = s \right] \Bigr\rvert &\leq \gamma^n \max_s \mathbb{E}_{\pi} \left[ \Bigl\lvert V_{t + n - 1}(S_{t}) - \sum_{k = 0}^{\infty} \gamma^k R_{t + k + 1} \Bigr\rvert \mid S_t = s \right] \\ % &= \gamma^n \max_s \mathbb{E}_{\pi} \left[ \Bigl\lvert V_{t + n - 1}(S_{t}) - v_{\pi}(s) \Bigr\rvert \mid S_t = s \right]. \end{aligned}

After this rewriting we've got a hidden $\mathbb{E}_{\pi}$ "inside" another $\mathbb{E}_{\pi}$ (because the definition of $v_{\pi}(s)$ contains an $\mathbb{E}_{\pi}$), which I suppose is kind of ugly... but mathematically meaningless.

The maximum of a random variable is an upper bound on the expectation of that random variable, so we can get rid of the expectation in the right-hand side (again potentially increasing the right-hand side, which again is still fine since it's already an upper bound anyway):

\begin{aligned} \max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ G_{t:t+n} - v_{\pi}(s) \mid S_t = s \right] \Bigr\rvert &\leq \gamma^n \max_s \Bigl\lvert V_{t + n - 1}(s) - v_{\pi}(s) \Bigr\rvert, \end{aligned}

which we can finally rewrite to Equation (7.3) in the book by moving the subtraction of $v_{\pi}(s)$ outside of the expectation on the left-hand side of the inequality (which is fine because, as I already mentioned above, the definition of $v_{\pi}(s)$ itself contains another $\mathbb{E}_{\pi}$ anyway).

Philip Raeisghasem
  • 2,028
  • 9
  • 29
Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66
  • Your bullet points explaining the "trick" make intuitive sense. Could you provide a link to someone showing the formal steps? – Philip Raeisghasem Mar 18 '19 at 09:13
  • @PhilipRaeisghasem I don't know of any such links by heart, so can't provide one, sorry :( – Dennis Soemers Mar 18 '19 at 19:18
  • I'd argue that, if it's that hard to find this information, then including it in the post would greatly increase the post's value. Or maybe as a separate, linked question? – Philip Raeisghasem Mar 18 '19 at 19:50
  • @PhilipRaeisghasem Absolutely, either would be fine... maybe separate question better, since this one's already fairly long. I'm currently working towards a paper submission deadline though, so I really don't personally feel like working out the math and writing it all down any time soon. – Dennis Soemers Mar 18 '19 at 20:23
  • Can we just say that s_t+1 is a subset of s therefore its max over s_t+1 must <= max over s? – Phizaz Aug 29 '19 at 07:50
  • 1
    @Phizaz Hmmm not exactly. Close to that, I'd say "given a completely free, unrestricted choice for $S_t$, the set of possible states $S_{t+n}$ is a subset of the set of possible states $S_t$". Very often it will not be a proper subset though, very often they'll be equal. But even then, a very important other point is that the **probability distributions** over those sets will be different. When the choice of $S_t$ is unrestricted, you can assign full probability mass to whichever is "best". Even if that best state may still be possible for $S_{t+n}$, it may no longer have full probability mass – Dennis Soemers Aug 29 '19 at 08:15
  • 1
    @DennisSoemers I agree. However, will we not be able to prove without the probability mass argument? – Phizaz Aug 30 '19 at 12:59
  • @Phizaz Ah yes I think you're completely right. Intuitively it didn't "feel" right to me to omit that... but I guess math doesn't care about my feelings :) – Dennis Soemers Aug 30 '19 at 13:24
  • how to intuitively understand the error reduction property? how does this guarantee the convergence? – FantasticAI Oct 30 '20 at 01:20
1

I will try to answer the question in a lesser mathematical (and hopefully correct way).

NOTE: I have used $V_{\pi}$ and $v_{\pi}$ interchangeably.

We start from LHS:

$$\max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ G_{t:t+n} \mid S_t = s \right] - v_{\pi}(s) \Bigr\rvert$$

This can be written in terms of trajectories. Say the probability of observing a $n$ step trajectory (for a $n$ step return) is $p_j^{s}$ from state $S_t = s$ . Thus we can write the expected return as sum of returns from all trajectories multiplied with the probability of the trajectory:

$$\mathbb{E}_{\pi} [G_{t:t+n}|S_t = s] = \sum_j p_j^sG_{t:t+n}^j = \sum_j p_j^s [R_{t+1}^j + \gamma R_{t+2}^j.....\gamma^{n-1}R_{t+n}^j + \gamma^n V_{t+n-1}(S_{t + n})^j]$$

We use @Dennis's terminology for $n$ step rewards i.e

$R_{t:t+n}^j \doteq R_{t + 1}^j + \gamma R_{t + 2}^j + \dots + \gamma^{n - 1} R_{t + n}^j$.

Now we know $v_{\pi}(s)$ is nothing but $\mathbb{E}_{\pi} [G_{t:t+n}^{\pi}|S_t = s]$ where I have used $G_{t:t+n}^{\pi}$ to denote that the returns are actual returns if we have evaluated the policy completely (the value functions are consistent with the policy) for every state(using infinite episodes maybe) i.e $G_{t:t+n}^{\pi} = R_{t:t+n} + \gamma^n V_{\pi}(S_{t + n})$.

So now if we evaluate the equation:

$$\max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ G_{t:t+n} \mid S_t = s \right] - v_{\pi}(s) \Bigr\rvert = \max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ G_{t:t+n} \mid S_t = s \right] - \mathbb{E}_{\pi} [G_{t:t+n}^{\pi}|S_t = s]\Bigr\rvert$$

which is further written in form of trajectory probabilities (for easier comprehension):

$$\max_s \Bigl\lvert \sum_j p_j^s(R_{t:t+n}^j+\gamma^n V_{t+n-1}^j(S_{t + n})) - \sum_j p_j^s(R_{t:t+n}^j+\gamma^n V_{\pi}^j(S_{t + n})) \Bigr\rvert$$

Now the equation can be simplified by cancelling the reward terms ($R_{t:t+n}^j$) as they are the same for a trajectory, and thus is common to both the terms we get: $$\max_s \Bigl\lvert \gamma^n\sum_j p_j^s( V_{t+n-1}^j(S_{t + n})-V_{\pi}^j(S_{t + n}))\Bigr\rvert$$

This is basically the expectation of the deviation of each and every state at the $n$ th step (from its actual value $V_{\pi}$,), starting from $S_t = s$ and multiplied by a discount factor.

Now using the identity $E[X] \leq \max X$ we get:

$$\max_s \Bigl\lvert \gamma^n\sum_j p_j^s( V_{t+n-1}^j(S_{t + n})-V_{\pi}^j(S_{t + n}))\Bigr\rvert \leq \max \Bigl\lvert \gamma^n( V_{t+n-1}^j(S_{t + n})-V_{\pi}^j(S_{t + n}))\Bigr\rvert$$

Which can finally be written as: $$\max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ G_{t:t+n} \mid S_t = s \right] - v_{\pi}(s) \Bigr\rvert \leq \max \Bigl\lvert \gamma^n( V_{t+n-1}^j(S_{t + n})-V_{\pi}^j(S_{t + n}))\Bigr\rvert$$

Now the RHS is true for only those states reachable from $S_t = s$ via a trajectory, but since it has a maximizing operation we can include the whole state space in the $\max$ operation without any problem to finally write:

$$\max_s \Bigl\lvert \mathbb{E}_{\pi} \left[ G_{t:t+n} \mid S_t = s \right] - v_{\pi}(s) \Bigr\rvert \leq \max_s \Bigl\lvert \gamma^n( V_{t+n-1}(s)-V_{\pi}(s))\Bigr\rvert$$

which concludes the proof.