5

While I've been able to solve MountainCar-v0 using Deep Q learning, no matter what I try I can't solve this enviroment using policy-gradient approaches. As far as I learnt searching the web, this is a really hard enviroment to solve, mainly because the agent is given a reward only when it reaches the goal,which is a rare event. I tried to apply the so called "reward engineering", more or less substituting the reward given by the enviroment with a reward based upon the "energy" of the whole system (kinetic plus potential energy), but despite this no luck. I ask you:

  • is correct to assume that MountainCar-v0 is beyond the current state of the art A3C algorithm, so that it requires some human intervention to suggest the agent the policy to follow, for example adopting reward engineering?
  • could anyone provide any hint about which reward function could be used, provided that reward engineering is actually needed ?

Thanks for your help.

Scorpio76
  • 61
  • 2

1 Answers1

2

I don't know about your first question, but I got a basic policy gradient approach with the kinetic energy as reward working on MountainCar-v0.

You can implement it based on this blog and the notebook you find there. It uses an MLP with one hidden layer of size 128 and standard policy gradient learning.

The reward engineering boils down to replacing the reward variable with the kinetic energy $v^2$ (no potential energy and no constant factor, the reward itself is not used). I takes $>1000$ episodes to solve the environment consistently.

I'm afraid the solution is not very satisfactory and I don't have the feeling there is much to learn from it. The solution is originally for the cartpole problem and it stops working for me if I change hyperparameters/optimizer or the specifics of the reward.

christoph
  • 21
  • 2
  • Thanks for your reply, Christoph. At the very end, I was able to solve the problem using a custom reward (like the one you suggested). I think that without a reward providing the agent some "hint" about the quality of its policy, the original problem could be solved only if the agent reaches the top of the hill basically by chance. – Scorpio76 Oct 07 '19 at 06:03