0

I did not understand how the equation below is obtained. Ie, how is the first line obtained and where is the information $v_{\pi}(x_k)=v_{\pi'}(x_k)$ used in the derivation regarding greedy policy?


enter image description here

where $\pi^{\prime}(x_k)=argmax_{u_k \in \mathcal{U}}q_{\pi}(x_k, u_k)$.


Note: the source of the above equation is slide 22, at the follwing link: https://groups.uni-paderborn.de/lea/share/lehre/reinforcementlearning/lecture_slides/built/Lecture03.pdf

nbro
  • 39,006
  • 12
  • 98
  • 176
DSPinfinity
  • 301
  • 1
  • 8
  • 1
    Can you please put your **specific question** in the title? You can also use mathjax in the title – nbro Feb 03 '23 at 11:21
  • I did. I will be happy if you can provide an answer to this question as a top expert in RL. – DSPinfinity Feb 04 '23 at 13:06
  • 1
    I don't know why they didn't use the more common notation $s$ and $a$ to denote states and actions, respectively. Anyway, I'll check your question later (if I don't forget). – nbro Feb 04 '23 at 17:33

1 Answers1

2

Policy iteration means to improve the policy step by step, so that $v_{\pi}(x_{k}) \leq v_{\pi'}(x_{k})$ for all $x \in X$. If we have $v_{\pi}(x_{k}) = v_{\pi'}(x_{k})$ for all $x \in X$, this means that we do not need to iterate the policy any further since the resulting actions of $\pi$ and $\pi'$ are of the same quality and so we have already found the optimal policy. As you mentioned \begin{equation} \pi'(x_{k}) = \underset{u_{k}\in U}{\mathrm{argmax}}~q_{\pi}(x_{k}, u_{k}) \end{equation} and also, since we are using a greedy policy (s. Sutton-Barto eq. 3.18 for the last equality sign and this answer for a step-to-step derivation) \begin{equation} v_{\pi'}(x_{k}) = v_{\pi}(x_{k}) = \max_{u_{k}\in U} q_{\pi}(x_{k}, u_{k}). \end{equation} From here it follows \begin{align} v_{\pi'}(x_{k}) = \max_{u_{k}\in U} q_{\pi}(x_{k}, u_{k}) =\max_{u_{k}} \mathbb{E} [R_{k+1} + \gamma v_{\pi}(X_{k+1})|X_{k} =x_{k}, U_{k}=u_{k}] \end{align} and since the equality $v_{\pi}(x_{k}) = v_{\pi'}(x_{k})$ for all $x_{k} \in X$ we can do the final step: \begin{align} v_{\pi'}(x_{k}) = \max_{u_{k}\in U} q_{\pi}(x_{k}, u_{k}) =\max_{u_{k}} \mathbb{E} [R_{k+1} + \gamma v_{\pi'}(X_{k+1})|X_{k} =x_{k}, U_{k}=u_{k}] \end{align}

Edit: This is the answer I posted to an older version of the question, where it was not clear that OP wanted to know how to derive the first line, but I rather assumed that the simplification between line one and two was asked to explain.
The second line in your equation is actually just the matrix notation (s. slide 9 in your document) of the first one and shows how to build the expectation value. \begin{align} v_{\pi'}(x_{k}) &= \max\limits_{u_{k} \in U} \mathbb{E} [R_{k+1} + \gamma v_{\pi'}(X_{k+1})|X_{k}=x_{k}, U_{k}=u_{k}] \\ &= \max\limits_{u_{k}} R^{u}_{x} + \gamma \mathbb{E} [v_{\pi'}(X_{k+1})|X_{k}=x_{k}, U_{k}=u_{k}] \\ &= \max\limits_{u_{k}} R^{u}_{x} + \gamma \sum_{x_{k+1} \in X}p(x_{k+1} | x_{k}, u_{k}) v_{\pi'}(x_{k+1}) \\ &=\max\limits_{u_{k}} R^{u}_{x} + \gamma \sum_{x_{k+1} \in X}p^{u}_{xx'} v_{\pi'}(x_{k+1}) \end{align}

In the first step, we just build build the expectation value over the reward and moved $\gamma$ out of the expectation value. Then Just build the expecatation value, so calculate the value function of a given state $x_{k+1}$ with the probability of ending up there. Then finally, just change the notation to fit with the standard notation provided by OP.

pythonic833
  • 312
  • 1
  • 9
  • I updated the question and the points not clear for me. Could you please update your answer also? – DSPinfinity Feb 03 '23 at 05:34
  • @DSPinfinity Thanks for the clarification, I updated the answer – pythonic833 Feb 04 '23 at 23:03
  • Thank you for your efforts but I think still the answer is not correct. In your answer you say "since we have found the optimal policy ", but that is what you want to show!. – DSPinfinity Feb 05 '23 at 10:33
  • So first, if you are using a procedure that leads to the optimal policy and your policy converges, so $v_{\pi}(x) = v_{\pi'}(x)$ for all $x$, then you found the optimal policy. Second, this equation holds even for a non-optimal but greedy, deterministic policy. – pythonic833 Feb 05 '23 at 10:42
  • I've updated my answer to just refer to a greedy policy and linked an answer where a step-to-step derivation is shown. Note that this step-to-step derivation mentions the optimal policy but doesn't assume any properties that are only valid for the optimal and not for non-optimal case. – pythonic833 Feb 05 '23 at 11:02
  • 1
    You can use `\leq` or `\geq` instead of `<=` and `>=` ;) – nbro Feb 06 '23 at 11:42
  • @nbro, thanks and done. – pythonic833 Feb 06 '23 at 11:43