An example of a unique value function which is associated with multiple optimal policies

Question

In the 4th paragraph of http://www.incompleteideas.net/book/ebook/node37.html it is mentioned:

Whereas the optimal value functions for states and state-action pairs are unique for a given MDP, there can be many optimal policies

Could you please give me a simple example that shows different optimal policies considering a unique value function?

score 2 · Accepted Answer · answered Aug 20 '18 at 10:03

Consider a very simple grid-world, consisting of 4 cells, where an agent starts in the bottom-left corner, has actions to move North/East/South/West, and receives a reward $R = 1$ for reaching the top-right corner, which is a terminal state. We'll name the four cells $NW$, $NE$, $SW$ and $SE$ (for north-west, north-east, south-west and south-east). We'll take a discount factor $\gamma = 0.9$.

The initial position is $SW$, and the goal is $NE$, which an optimal policy should reach as quickly as possible. However, there are two optimal policies for the starting state $SW$: we can either go north first, and then east (i.e., $SW \rightarrow NW \rightarrow NE$), or we can go east first, and then north (i.e., $SW \rightarrow SE \rightarrow NE$). Both of those policies are optimal, both reach the goal state in two steps and receive a return of $\gamma \times 1 = 0.9$, but they are clearly different policies, they choose different actions in for the initial state.

Note that my language was slightly informal above when talking about "policies for the starting state". Formally, I should have said that there are two optimal policies that select different actions in the starting state (and the same actions in all other states).

An example of a unique value function which is associated with multiple optimal policies

1 Answers1