How is the first line obtained and where is the information $v_{\pi}(x_k)=v_{\pi'}(x_k)$ used in the following derivation regarding greedy policy?

Question

I did not understand how the equation below is obtained. Ie, how is the first line obtained and where is the information $v_{\pi}(x_k)=v_{\pi'}(x_k)$ used in the derivation regarding greedy policy?

where $\pi^{\prime}(x_k)=argmax_{u_k \in \mathcal{U}}q_{\pi}(x_k, u_k)$.

Note: the source of the above equation is slide 22, at the follwing link: https://groups.uni-paderborn.de/lea/share/lehre/reinforcementlearning/lecture_slides/built/Lecture03.pdf

Can you please put your **specific question** in the title? You can also use mathjax in the title — nbro, Feb 03 '23 at 11:21
I did. I will be happy if you can provide an answer to this question as a top expert in RL. — DSPinfinity, Feb 04 '23 at 13:06
I don't know why they didn't use the more common notation $s$ and $a$ to denote states and actions, respectively. Anyway, I'll check your question later (if I don't forget). — nbro, Feb 04 '23 at 17:33

pythonic833 · Accepted Answer · 2023-02-06T11:43:34.190

Policy iteration means to improve the policy step by step, so that $v_{\pi}(x_{k}) \leq v_{\pi'}(x_{k})$ for all $x \in X$. If we have $v_{\pi}(x_{k}) = v_{\pi'}(x_{k})$ for all $x \in X$, this means that we do not need to iterate the policy any further since the resulting actions of $\pi$ and $\pi'$ are of the same quality and so we have already found the optimal policy. As you mentioned \begin{equation} \pi'(x_{k}) = \underset{u_{k}\in U}{\mathrm{argmax}}~q_{\pi}(x_{k}, u_{k}) \end{equation} and also, since we are using a greedy policy (s. Sutton-Barto eq. 3.18 for the last equality sign and this answer for a step-to-step derivation) \begin{equation} v_{\pi'}(x_{k}) = v_{\pi}(x_{k}) = \max_{u_{k}\in U} q_{\pi}(x_{k}, u_{k}). \end{equation} From here it follows \begin{align} v_{\pi'}(x_{k}) = \max_{u_{k}\in U} q_{\pi}(x_{k}, u_{k}) =\max_{u_{k}} \mathbb{E} [R_{k+1} + \gamma v_{\pi}(X_{k+1})|X_{k} =x_{k}, U_{k}=u_{k}] \end{align} and since the equality $v_{\pi}(x_{k}) = v_{\pi'}(x_{k})$ for all $x_{k} \in X$ we can do the final step: \begin{align} v_{\pi'}(x_{k}) = \max_{u_{k}\in U} q_{\pi}(x_{k}, u_{k}) =\max_{u_{k}} \mathbb{E} [R_{k+1} + \gamma v_{\pi'}(X_{k+1})|X_{k} =x_{k}, U_{k}=u_{k}] \end{align}

Edit: This is the answer I posted to an older version of the question, where it was not clear that OP wanted to know how to derive the first line, but I rather assumed that the simplification between line one and two was asked to explain.
The second line in your equation is actually just the matrix notation (s. slide 9 in your document) of the first one and shows how to build the expectation value. \begin{align} v_{\pi'}(x_{k}) &= \max\limits_{u_{k} \in U} \mathbb{E} [R_{k+1} + \gamma v_{\pi'}(X_{k+1})|X_{k}=x_{k}, U_{k}=u_{k}] \\ &= \max\limits_{u_{k}} R^{u}_{x} + \gamma \mathbb{E} [v_{\pi'}(X_{k+1})|X_{k}=x_{k}, U_{k}=u_{k}] \\ &= \max\limits_{u_{k}} R^{u}_{x} + \gamma \sum_{x_{k+1} \in X}p(x_{k+1} | x_{k}, u_{k}) v_{\pi'}(x_{k+1}) \\ &=\max\limits_{u_{k}} R^{u}_{x} + \gamma \sum_{x_{k+1} \in X}p^{u}_{xx'} v_{\pi'}(x_{k+1}) \end{align}

In the first step, we just build build the expectation value over the reward and moved $\gamma$ out of the expectation value. Then Just build the expecatation value, so calculate the value function of a given state $x_{k+1}$ with the probability of ending up there. Then finally, just change the notation to fit with the standard notation provided by OP.

I updated the question and the points not clear for me. Could you please update your answer also? — DSPinfinity, Feb 03 '23 at 05:34
@DSPinfinity Thanks for the clarification, I updated the answer — pythonic833, Feb 04 '23 at 23:03
Thank you for your efforts but I think still the answer is not correct. In your answer you say "since we have found the optimal policy ", but that is what you want to show!. — DSPinfinity, Feb 05 '23 at 10:33
So first, if you are using a procedure that leads to the optimal policy and your policy converges, so $v_{\pi}(x) = v_{\pi'}(x)$ for all $x$, then you found the optimal policy. Second, this equation holds even for a non-optimal but greedy, deterministic policy. — pythonic833, Feb 05 '23 at 10:42
I've updated my answer to just refer to a greedy policy and linked an answer where a step-to-step derivation is shown. Note that this step-to-step derivation mentions the optimal policy but doesn't assume any properties that are only valid for the optimal and not for non-optimal case. — pythonic833, Feb 05 '23 at 11:02

How is the first line obtained and where is the information $v_{\pi}(x_k)=v_{\pi'}(x_k)$ used in the following derivation regarding greedy policy?

1 Answers1