Should we feed a greater fraction of terminal states to the value network so that their values are learned first?

Question

The basis of Q-learning is recursive (similar to dynamic programming), where only the absolute value of the terminal state is known.

Shouldn't it make sense to feed the model a greater proportion of terminal states initially, to ensure that the predicted value of a step in terminal states (zero) is learned first?

Will this make the network more likely to converge to the global optimum?

score 3 · Accepted Answer · edited Feb 13 '21 at 14:08

The basis of Q-learning is recursive (similar to dynamic programming), where only the absolute value of the terminal state is known.

This may be true in some environments. Many environments do not have a terminal state, they are continuous. Your statement may be true for instance in a board game environment where the goal is to win, but it is false for e.g. the Atari games environment.

In addition, when calculating the value of the terminal state, it is always zero, so often a special hard-coded $0$ is used, and the neural network is not required to learn that. So it is only for deterministic transitions $(S,A) \rightarrow (R, S^T)$ where you need to learn that $Q(S,A) = R$ absolutely.

Shouldn't it make sense to feed the model a greater proportion of terminal states initially, to ensure that the predicted value of a step in terminal states (zero) is learned first?

In a situation where you have and know the terminal state, then yes this could help a little. It will help most where the terminal state is also a "goal" state, which does not have to be the case, even in episodic problems. For instance in a treasure-collecting maze where the episode ends after a fixed time, knowing the value of the terminal state and transitions close to it is less important to optimal control than establishing the expected return for earlier parts of the path.

Focusing on "goal" states does not generalise to all environments, and is of minimal help once the network has approximated Q values close to terminal and/or goal states. There are more generic approaches than your suggestion for distributing knowledge of sparse rewards including episode termination:

Prioritised sweeping. This generalises your idea of selectively sampling where experience shows that there is knowledge to be gained (by tracking current error values and transitions).
n-step temporal difference. Using longer trajectories to calculate TD targets increases variance but reduces bias and allows assignment of reward across multiple steps quickly. This is extended in TD($\lambda$) to allow parametric mixing of multiple length trajectories and can be done online using eligibility traces. Combining Q($\lambda$) with deep neural networks is possible - see this paper for example.

score 1 · Answer 2 · answered May 02 '18 at 08:24

If you have enough domain knowledge to be able to reliably, intentionally reach those terminal states often when generating experience, yeah, that could help.

Generally, the assumption in Reinforcement Learning is no domain knowledge other than the assumption that we're in a Markov Decision Process. This means we start learning from scratch, and before extensive learning we do not know how to reach terminal states. If we don't know how to reach terminal states, we also can't deliberately go to them to generate the experiences we want as you suggest.

I was thinking more along the lines of indexing terminal states separately and adding them to the batch based on some probability μ. — sloth, May 02 '18 at 15:16

Should we feed a greater fraction of terminal states to the value network so that their values are learned first?

2 Answers2