0

I am running the A3C algorithm to evaluate a policy based on a policy gradient method. I observe an unexpected behavior at the start of the episode in the reward and episode length. As shown in the figure blew, at the start of the training the episode length increases and it starts to decrease after a peak and become normal behavior. For the reward, it decreases at the start and then increaes. May I ask what can cause this kind of unexpected behavior in the episode ?

enter image description here enter image description here

0 Answers0