5

Both value iteration and policy iteration are General Policy Iteration (GPI) algorithms. However, they differ in the mechanics of their updates. Policy Iteration seeks to first find a completed value function for a policy, then derive the Q function from this and improve the policy greedily from this Q. Meanwhile, Value Iteration uses a truncated V function to then obtain Q updates, only returning the policy once V has converged.

What are the inherent advantages of using one over the other in a practical setting?

nbro
  • 39,006
  • 12
  • 98
  • 176
SeeDerekEngineer
  • 521
  • 4
  • 11
  • 2
    I think there's a typo here "V function to then obtain Q updates". What do you mean by "Q updates"? Maybe you mean "to obtain policy updates". – nbro Oct 11 '21 at 22:20

2 Answers2

1

Value Iteration Vs Policy Iteration

Below is the list of differences & similarities between value iteration and policy iteration

Differences

Value Iteration Policy Iteration
Architecture Value iteration networkVI Architecture reference Policy iteration architecture PI Architecture Reference
Execution starts with a random value function random policy
Algorithm simpler complex
Computation costs more expensive cheaper
Execution time Slower Faster
No of Iterations to converge significantly more takes fewer iteration to converge
Guaranteed to converge Yes Yes

Similarities

  • both are dynamic programming algorithms
  • both employ variations of Bellman updates
  • Both exploit one-step look-ahead
  • Both algorithms are guaranteed to converge to an optimal policy in the end

Policy iteration is reported to conclude faster than value iteration

USAGE PREFERENCE

As mentioned earlier in the difference, the main advantage for using Policy iteration over value iteration is its ability to conclude faster with fewer iterations thereby reducing its computation costs and execution time.

REFERENCES

  1. Research papers
  2. Book
    • Artificial Intelligence: A Modern Approach, by Peter Norvig and Stuart J. Russell Chapter 17 Making Complex decisions
  3. Architecture
Archana David
  • 277
  • 2
  • 9
  • 4
    The presentation/readability of this answer is good, but this does not answer the question "When to use one over the other?". Moreover, you do not explain 1. why value iteration is slower and more expensive than policy iteration and 2. who claimed that (i.e. cite something reliable). In any case, I don't understand why you say that policy iteration is "complex". Both algorithms are simple enough. – nbro Oct 12 '21 at 06:45
  • 2
    I added the notice because you're not really citing anything that supports the claim e.g. that policy iteration "takes fewer iteration to converge". Provide inline references/citations to research papers or books that support those claims (e.g. you can use this mathjax style `[\[1\][1]]` and it will be shown as `[1]`. Moreover, your last edit actually made things worse. For example, the diagram for value iteration is wrong. You're confused. Please, do not provide wrong information to the site. If you're not reasonably familiar with the concepts, don't provide an answer just by guessing. – nbro Dec 26 '21 at 18:52
  • 1
    @nbro: The claim that policy iteration is "complex" is made, though not explained, in [this tutorial](https://www.baeldung.com/cs/ml-value-iteration-vs-policy-iteration), on which this answer appears to be very closely based. – Scortchi - Reinstate Monica Dec 29 '21 at 14:09
  • 1
    @Scortchi-ReinstateMonica Yes, thanks. It looks like this answer is based on that tutorial. The initial table provided by the OP is really analogous to the one given in that tutorial. This answer is now more misleading than originally. In the last edit, the OP provided the diagram of "Value Iteration Network", which is not "value iteration". We may need to delete this answer to avoid spreading misinformation, although some info in this answer may be correct. More importantly, this does not even answer the question "when to use one over the other?". Not sure why it was accepted. – nbro Dec 29 '21 at 14:16
0

Value iteration (VI) is a truncated version of Policy iteration (PI).

PI has two steps:

  • the first step is policy evaluation. That is to calculate the state values of a given policy. This step essentially solves the Bellman equation $$v_\pi=r_\pi+\gamma P_\pi v_\pi$$ which is the matrix vector form of the Bellman equation. I assume that the basics are already known.
  • The second is policy improvement. That is to select the action corresponding to the greatest action value at each state (i.e., greedy policy): $$\pi=\arg\max_\pi(r_\pi+\gamma P_\pi v_\pi)$$

The key point is: the policy iteration step requires an infinite number of iterations to solve the Bellman equation (i.e., get the exact state value). In particular, we use the following iterative algorithm so solve the Bellman equation: $$v_\pi^{(k+1)}=r_\pi+\gamma P_\pi v_\pi^{(k)}, \quad k=1,2,\dots$$ We can prove that $v_\pi^{(k)}\rightarrow v_\pi$ as $k\rightarrow\infty$. There are three cases to execute this iterative algorithm:

  • Case 1: run an infinite number of iterations so that $ v_\pi^{(\infty)}=v_\pi$. This is impossible in practice. Of course, in practice, we may run sufficiently many iterations until certain metrics (such as the difference between two consecutive values) are small enough).
  • Case 2: run just one single step so that $ v_\pi^{(2)}$ is used for policy improvement step.
  • Case 3: run a few times (e.g., N times) so that $ v_\pi^{(N+1)}$ is used for the policy improvement step.

Case 1 is the policy iteration algorithm; case 2 is the value iteration algorithm; case 3 is a more general truncated version. Such a truncated version does not require infinite numbers of iterations and can converge faster than case 2, it is often used in practice.

  • 1
    GPI is not a practical algorithm as suggested in this answer, but instead a principle behind design of most value-based approaches. The attempt to define policy iteration as not used in practice, because no-one runs it in an infinite loop, is a bold move, and not something you would see in e.g. Sutton & Barto, where the issue is simply resolved by a hyperparameter $\theta$ that stops the evaluation loop. The resulting algorithm is still called Policy Iterationm and *is* used in practice without the need to invoke GPI – Neil Slater Oct 22 '21 at 14:22
  • @NeilSlater My answer suggests that GPI is more practical (instead of not practical as you commented) than PI and VI. If we stop at some iteration by what ever methods, the algorithm falls into the scope of GPI. Even if it is called PI in this case by some people, it really depends on how people name these algorithms. –  Oct 23 '21 at 06:37
  • 1
    The GPI is not an actual algorithm that you can implement directly (with perhaps one or two free choices). Whilst PI and VI are. Of course you are free to categorise things how you like, but your choice here puts you at odds with the literature. – Neil Slater Oct 23 '21 at 07:45
  • For example of definition I am using see Sutton & Barto 2nd edition, section 4.6 – Neil Slater Oct 23 '21 at 09:32
  • @NeilSlater I respectfully disagree. All of VI, PI and GPI are actually algorithms that can be implemented. GPI can be viewed as a principle, but it is also a class of algorithms that execute a number of (neither one nor sufficiently many interactions to solve the Bellman equation) interactions. We are both referring to the well-known book. I also read the section you mentioned. I don't see the book is defining these concepts differently. Please clearly specify the mistakes if I made any. –  Oct 23 '21 at 13:08
  • 1
    @Shiyu The best way to prove your point would have been to quote the book. However, if you read [section 4.6 of the book we're talking about (2nd edition)](https://incompleteideas.net/book/RLbook2020.pdf#page=109), you will see that they write something that is inconsistent with what you wrote in this answer and that is consistent with what Neil wrote above. For completeness, here's what they write. – nbro Oct 23 '21 at 14:43
  • 1
    "_We use the term generalized policy iteration (GPI) to refer to **the general idea** of letting policy-evaluation and policy-improvement processes interact, independent of the granularity and other details of the two processes. **Almost all reinforcement learning methods are well described as GPI**. That is, all have identifiable policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy, as suggested by the diagram to the right._" – nbro Oct 23 '21 at 14:44
  • 1
    So, GPI is not an algorithm. It's a general idea. You could say it's a class of algorithms, in the same way that "[best-first search](https://en.wikipedia.org/wiki/Best-first_search)" is not an algorithm, but a class of algorithms, where (specific) algorithms like A*, B* or greedy BFS fall into. – nbro Oct 23 '21 at 14:46
  • @Shiyu: Well GPI can be considered "implemented" in PI, VI, SARSA, Expected SARSA, Monte Carlo Control (online and offline), Q-Learning, DQN, DDQN etc. Do you think you could find an actual implementation of GPI as its own distinct algoirithm though, in source code or pseudo-code? This seems to be what you are claiming If so, please link it. You will find pseudocode implementations of many value-based algorithms in the S&B book (including an implementation of PI that does not have the problem you claim it has in "case 1" in this answer), but not any pseudocode for GPI . . . – Neil Slater Oct 23 '21 at 15:11
  • @nbro Thanks. You made your point very clear. Now I understand the difference between my view and yours and also Neil's. I focused more on the **narrow sense** of generalized policy gradient. That is, it is a generalized version of VI and PI. The **broad sense** of GPI is like all value-based RL algorithms. What I was referring to could be described more precisely as **truncated policy iteration**. I have revised my answer and please see if it could contribute as a helpful answer. –  Oct 23 '21 at 15:28
  • @NeilSlater With the help of nbro, I understand your point clearly now. However, as an experienced member, you really don't have to do this, especially to a new comer who wishes to contribute to the forum. For example, your latest comment especially the tone does not help to resolve the discrepancy between us but rather leading to more arguments. –  Oct 23 '21 at 15:34
  • @NeilSlater I also revised my answer and please see if it could be a helpful answer to the other people now. –  Oct 23 '21 at 15:35
  • Thanks for editing your answer. I think now it's more correct, although I am not sure about the part "policy iteration step requires an infinite number of iterations to solve the Bellman equation". PI may converge in a finite number of iterations. But does that mean that the converged value function is exactly the same as the optimal one? Probably, not exactly the same (also because of numerical errors), but close enough (I guess that's what we mean by "convergence"). – nbro Oct 23 '21 at 15:45
  • 1
    By the way, I don't think that Neil was impolite. He was just trying to understand your point. Maybe you had seen GPI as a specific algorithm and he was trying to understand what you were referring to. I think we should take these comments as constructive, but I understand that it's easy to misunderstand/misinterpret a comment, and it's not always easy to express ourselves in the most appropriate way on the internet (I've also made this mistake in the past). – nbro Oct 23 '21 at 15:48
  • @nbro About the infinite number of iterations, I am talking about the math that the value converges to the true state value under the current policy as $k\rightarrow\infty$. I am more interested in the math behind RL. Of course, in practice, we say it converges when certain metric is sufficiently small. I will try make this point more clear. –  Oct 23 '21 at 16:04
  • @nbro About the comments, with this experience, I totally agree that we may not express ourselves as clear as we think, and we may easily misinterpret the meaning of somebody else' comments. –  Oct 23 '21 at 16:06