1

I'm currently training a DQN agent. I use an epsilon greedy exploration strategy where I decay the epsilon value linearly until it reaches 0 over 300 episodes. For the rest of the remaining 50 episodes, epsilon is always 0. Since the value is 0, I thought that the agent would always select the same actions throughout the training but that does not seem to happen since the reward of the last 50 episodes is not exactly the same as can be seen by the graph below. The blue graph is a plot of the reward for each episode and the orange one is the average reward of the last 100 episodes.

enter image description here

Is increasing the number of episodes after epsilon reaches 0 a correct solution for this? I have not tried this as each episode takes approximately 20 seconds to complete, so the training can become really long. Thanks in advance!

DeepQZero
  • 1,192
  • 1
  • 6
  • 22
gondorian
  • 35
  • 6

1 Answers1

1

Annealing $\epsilon$ to 0 in $\epsilon$-greedy DQN is intended to reduce the exploration capability of the DQN agent, but it does not prevent the DQN agent from continuing to learn. Typically, DQN incorporates a replay buffer that stores previous experience tuples of the form $(\mbox{state},\ \mbox{action},\ \mbox{reward},\ \mbox{next state})$. The DQN agent is still able to store new experience tuples and learn from all stored experience tuples when $\epsilon=0$. As the agent continues to learn, its policy may change, and even a slight change in policy may alter its choice of actions - this phenomenon appears to be what your graph is displaying.

To answer your question, most deep reinforcement learning (DRL) algorithms have numerous moving parts, and from what I've read in the literature, convergence to a fixed policy over consecutive episodes has not traditionally been a goal when training DRL algorithms. I personally would say that your chart shows fantastic results, as it exhibits great stability with negligible fluctuations in the reward graph when $\epsilon=0$. If you are absolutely determined to reach a fixed policy or reward graph during training, it would be helpful to know the specifics of the underlying environment, possibly by asking another question on this site with all of the details.

DeepQZero
  • 1,192
  • 1
  • 6
  • 22