3

If the agent is following an $\epsilon$-greedy policy derived from Q, is there any advantage to decaying $\epsilon$ even though $\epsilon$ decay is not required for convergence?

nbro
  • 39,006
  • 12
  • 98
  • 176
KaneM
  • 309
  • 2
  • 13

1 Answers1

3

Yes Q-learning benefits from decaying epsilon in at least two ways:

  • Early exploration. It makes little sense to follow whatever policy is implied by the initialised network closely, and more will be learned about variation in the environment by starting with a random policy. It is fairly common in DQN to initially fill the experience replay table whilst using $\epsilon = 1.0$ or otherwise an effectively random policy.

  • Later refinement. Q-learning can only learn from experiences it has. A behaviour policy that is too random may not experience enough states close to optimal in order to gain enough statistics on them to overcome variance. In more difficult cases, it may never experience the whole of an optimal trajectory even when combining all the different transitions that it ever observes.

In addition, when using a function approximator such as a neural network, predictions of the target policy will be influenced by the distribution of states and actions in the experience replay memory. If that is is biased towards the distribution of states of a very different behaviour policy, then basic Q-learning has no good way to adjust for that - it adjusts for differences in expected return between behaviour and target policies, but not in distribution of observed states. In fact this is still a somewhat open problem, you want the agent to learn from mistakes and imperfect behaviour so as to avoid it, but you don't want those mistakes to skew predictions of what should happen under an optimal policy. Function approximation such as neural networks is influenced by the distribution of input data, so it typically performs better in Q learning when the behaviour policy and target policy are close e.g. $\epsilon$ should be relatively low if using $\epsilon$-greedy.

A typical implementation might start with $\epsilon=1.0$, set a decay factor per time step or per episode e.g. $0.999$ per episode, and a minimum $\epsilon = 0.01$. These are all hyper-parameters that you can adjust depending on the problem.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60