2

Let's say I'm training a reinforcement learning agent to act in some environment that perpetually continues to give the agent opportunities to earn rewards, and there is no cap on the score and there is no way to "win". That is, there is no natural "end" to an episode.

In these scenarios, is the choice of episode length completely arbitrary? Is it easier to train on shorter episodes than longer ones?

nbro
  • 39,006
  • 12
  • 98
  • 176
Vladimir Belik
  • 342
  • 2
  • 12

1 Answers1

2

There are things that impact ideal pseudo-episode length for learning continuing (non-episodic) environments:

  • Start state. The start state of a continuing environment may be special in some way and unreachable later on. It is still important to learn optimal behaviour, and some choices may even be critical in order to reach the best areas of state space. If there is a standard/fixed start state, it means you may want to reset and restart pseudo-episodes more frequently.

  • Statistical coverage of repeating states. In a continuing environment under a fixed policy, there should be a set of ergodic states. That is a set of states that are visited at some expected long-term frequency (or frequency density for non-discrete state spaces) due to the combined behaviour from the policy and environment transitions. You want to sample from this set fairly, in an unbiased manner, in order to accurately learn expected values. This means you may want to reset and restart pseudo-episodes less frequently, to get long trajectories where the start state distribution has less impact on the relative frequency of each state in this repeating set.

These two issues are at odds with each other, so choosing a pseudo-episode length (when to artificially stop and re-start) is a hyperparameter that may require experimentation. The hyperparameter interacts with others, such as discount factor - for example a small discount factor may mean you care less about fair sampling of continuing states, although you will still care about visiting enough of them to get estimates from all reachable states.

There can be other issues that impact your choice, such as the RL method you are using, whether you are using approximation, whether there are some form of "attractors" that tend to get the agent stuck in inescapable loops. If you know you have these kind of trap state loops in your environment, you may want to treat that as similar to having special start states, and reset more often.

In these scenarios, is the choice of episode length completely arbitrary?

No, it is a hyperparameter that you may be able to take an educated guess at or figure out from observing behaviour.

Is it easier to train on shorter episodes than longer ones?

It depends on the environment, and there is no generally applicable preference for short or long pseudo-episodes.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60