In reinforcement learning, is the value of terminal/goal state always zero?

Question

Let's assume we are in a $3 \times 3$ grid world with states numbered as $0,1, \dots, 8$. Suppose that the goal state is $8$, the reward of landing in the goal state is $10$, and the reward of just wandering around in the grid world is $0$. Is the state-value of state $8$ always $0$?

score 3 · Accepted Answer · edited Nov 01 '20 at 15:11

In reinforcement learning, is the value of terminal/goal state always zero?

Yes, always for episodic problems, the value of a terminal state is always zero, from the definition.

The value of a state $v(s)$ is the expected sum (perhaps discounted) of rewards from all future time steps. There are no future time steps when in a terminal state, so this sum must be zero.

For the sake of consistent maths notation, you can consider a terminal state to be "absorbing", i.e. any transition out of it results in zero reward and returning to the same terminal state. Then you can use the definition of value function to show the same thing:

$$v_{\pi}(s) = \mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_{t} = s]$$

If $s = s_T$, the terminal state, then all the "future rewards" from $k=0$ onwards starting with reward $R_{t+1}$ must be zero. This is consistent with the reward $R_{t}$, i.e. the reward when transitioning to the terminal state, being any value.

You can show similar using action value functions, if you accept a "null" action in the terminal state.

To be clearer, the value of the terminal state is zero **by definition** of the terminal state. In theory, if you keep on receiving reward once you're in the terminal state, then the value won't be zero. — nbro, Feb 07 '20 at 12:25
@nbro: If you keep on receiving reward once you're in a "terminal state" then it wasn't a terminal state. A terminal state is where the MDP stops. I think that is the more fundamental definition, the rest is consequences and book-keeping around that. The use of "absorbing" states is an example of that book keeping, and it is used so that some of the theory for continuous and episodic MDPs can use the same notation and derivations. — Neil Slater, Feb 07 '20 at 13:56

In reinforcement learning, is the value of terminal/goal state always zero?

1 Answers1

Linked