I'm running the value iteration algorithm of a gridworld of 4*3, with two terminal nodes with -50 reward and one with +20 reward like so:
+--+--+-----+-----+
| | | -50 | +20 |
+--+--+-----+-----+
| | | -50 | |
+--+--+-----+-----+
| | | | |
+--+--+-----+-----+
| | | | |
+--+--+-----+-----+
There are 4 possible actions, up, down, left and right.
There is some noise in the system and for any action, there is a 0.8 probability of following it, or 0.1 probability of going either left from the action or 0.1 of going right from the action (never back).
When going to a wall, the agent just stays in place (with no negative reward)
I want to compute the best policy and value function for this scenario using value iteration.
If I chosse a discount factor of 1 (gamma=1), then my values all converge to 20 and my policy ends up like so:
value table policy*
+----+----+-----+-----+ +-------+-------+------+-------+
| 20 | 20 | -50 | +20 | | down | left | -50 | +20 |
+----+----+-----+-----+ +-------+-------+------+-------+
| 20 | 20 | -50 | 20 | | down | left | -50 | right |
+----+----+-----+-----+ +-------+-------+------+-------+
| 20 | 20 | 20 | 20 | | right | right | down | up |
+----+----+-----+-----+ +-------+-------+------+-------+
Am i doing something wrong here ? is it normal to converge to the max reward ?
Will a MDP in a grid world always converge to max reward if the discount factor is 1 ?