3

I'm new to Reinforcement Learning. For an internship, I am currently training Atari's "Montezuma's Revenge" using a double Deep Q-Network with Hindsight Experience Replay (HER) (see also this article).

HER is supposed to alleviate the reward sparseness problem. But since the reward is annoyingly too sparse, I have also added a Random Network Distillation (RND) (see also this article) to encourage the agent to explore new states, by giving it a higher reward when it reaches a previously undiscovered state and a lower reward when it reaches a state it has previously visited multiple times. This is the intrinsic reward I add to the extrinsic reward the game itself gives. I have also used a decaying greedy epsilon policy.

How well should this approach work? Because I've set it to run for 10,000 episodes, and the simulation is quite slow, because of the mini-batch gradient descent step in HER. There are multiple hyperparameters here. Before implementing RND, I considered shaping the reward, but that is just impractical in this case. What can I expect from my current approach? OpenAI's paper on RND cites brilliant results with RND on Montezuma's Revenge. But they obviously used PPO.

nbro
  • 39,006
  • 12
  • 98
  • 176

0 Answers0