Would anybody share the experience on how to train a hierarchical DQN to play the Montezuma's Revenge game? How should I design the reward function? How should I balance the anneal rate of the two-level?
I've been trying to train an agent to solve this game. The agent is with 6 lives. So, every time when the agent fetches the key and loses his life for the instability or power of the sub-goal network, the agent restart at the original location and simply go through the door, thus gets a huge reward. With an $\epsilon$-greedy rate 0.1, the agent is possible to choose the subgoal key network and fetch the key, so the agent always chooses the door as the subgoal.
Would anyone show me how to train this agent in the setting of one life?