1

I am referring to the Value Iteration (VI) algorithm as mentioned in Sutton's book below.

Rather than getting the greedy deterministic policy after VI converges, what happens if we try to obtain the greedy policy while looping through the states (i.e. using the argmax equation inside the loop)? Once our $\Delta < \theta$ and we break out of the loop, do we have an optimal policy from the family of optimal policies? Is this a valid thing to do?

I implemented the gambler's problem exercise mentioned in Sutton's book. The policies obtained after using standard VI and the method I described above are mostly similar, yet different for some states.

nbro
  • 39,006
  • 12
  • 98
  • 176
user529295
  • 359
  • 1
  • 10
  • Are you suggesting to replace all instances of $V(s)$ with $\pi(s)$, including an update rule $\pi(s) \leftarrow \text{argmax}_a \sum_{s',r} p(s',r|s,a)[r +\gamma \pi(s')]$ ? – Neil Slater Jun 28 '21 at 19:31
  • No. After we get the new $V(s)$, we calculate $\pi(s) \leftarrow \text{argmax}_a \sum_{s',r} p(s',r|s,a)[r +\gamma V(s')]$, rather than getting optimal $\pi(s)$ outside the loop. – user529295 Jun 28 '21 at 19:35
  • Ah, OK. That is simpler – Neil Slater Jun 28 '21 at 19:36
  • Will that have any effect? Is it a valid thing to do? – user529295 Jun 28 '21 at 19:38
  • Hello. Please, do not roll back my edit. It improves your post by specifying your actual question in the title. See [Why can people edit my posts? How does editing work?](https://ai.stackexchange.com/help/editing). In particular, I would like you to notice the sentence "_Editing is important for keeping questions and answers clear, relevant, and up-to-date. If you are not comfortable with the idea of your contributions being collaboratively edited by other trusted users, this may not be the site for you._". – nbro Oct 03 '21 at 23:53

1 Answers1

0

My understanding is that you want to extend the main update loop like this:

$\qquad V(s) \leftarrow \text{max}_a \sum_{r,s'} p(r,s'|s,a)[r+\gamma V(s')]$ $\qquad \pi(s) \leftarrow \text{argmax}_a \sum_{r,s'} p(r,s'|s,a)[r+\gamma V(s')]$

Where the first line is same as original, the second line is taken from the end update that calculates the optimal policy.

Once our $\Delta < \theta$ and we break out of the loop, do we have an optimal policy from the family of optimal policies? Is this a valid thing to do?

Yes.

The only problem with this approach is the extra wasted processing - you will evaluate and overwrite $\pi$ once per loop. Doing it once at the end, after convergence, is more efficient.

The policies obtained after using standard VI and the method I described above are mostly similar, yet different for some states.

There are lots of equivalent optimal policies in the gambler's problem exercise in the book. This is expected. If you re-run the normal value iteration routine you should see similar changes to the policy.

The thing the policies have in common is a kind of fractal sawtooth shape for amounts bid, with triangles trying to get funds to fixed points and spikes where half or all funds are bid in single go to reach 100 funds. if you plot bid amount versus current funds. It might be one large triangle, or broken down into multiple smaller ones with larger bid spikes separating them. That is the expected shape, unless probability of a win is greater than 0.5, in which case the optimal policy is to bet a minimal amount on each time step and rely on the law of averages to eventually reach the target almost guaranteed. So you could set it to 0.6, you should see both your scripts return this same "safe" policy.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • In some cases, if we overwrite and the $\pi$ turns out to be the same after two consecutive loop, then we can exit the loop early and wouldn't have to converge our value of $\Delta$ towards $\theta$ right? – user529295 Jun 28 '21 at 20:40
  • 1
    @user529295 No I don't think so. The policy may shift with small changes to value function, and may remain the same over some iterations. I think to guarantee an optimal policy you have to run value iteration until the values converge. – Neil Slater Jun 28 '21 at 20:49