1

I'm running the value iteration algorithm of a gridworld of 4*3, with two terminal nodes with -50 reward and one with +20 reward like so:

+--+--+-----+-----+
|  |  | -50 | +20 |
+--+--+-----+-----+
|  |  | -50 |     |
+--+--+-----+-----+
|  |  |     |     |
+--+--+-----+-----+
|  |  |     |     |
+--+--+-----+-----+

There are 4 possible actions, up, down, left and right.

There is some noise in the system and for any action, there is a 0.8 probability of following it, or 0.1 probability of going either left from the action or 0.1 of going right from the action (never back).

When going to a wall, the agent just stays in place (with no negative reward)

I want to compute the best policy and value function for this scenario using value iteration.

If I chosse a discount factor of 1 (gamma=1), then my values all converge to 20 and my policy ends up like so:

       value table                 policy*
+----+----+-----+-----+  +-------+-------+------+-------+
| 20 | 20 | -50 | +20 |  | down  | left  | -50  | +20   |
+----+----+-----+-----+  +-------+-------+------+-------+
| 20 | 20 | -50 | 20  |  | down  | left  | -50  | right |
+----+----+-----+-----+  +-------+-------+------+-------+
| 20 | 20 | 20  | 20  |  | right | right | down | up    |
+----+----+-----+-----+  +-------+-------+------+-------+

Am i doing something wrong here ? is it normal to converge to the max reward ?

Will a MDP in a grid world always converge to max reward if the discount factor is 1 ?

JeanMi
  • 155
  • 4
  • 1
    I linked to a question that effectively includes a duplicate of yours. In brief, yes this is normal, and there are some considerations, such as how to convert the values to a policy. – Neil Slater May 30 '22 at 11:06

0 Answers0