While I've been able to solve MountainCar-v0 using Deep Q learning, no matter what I try I can't solve this enviroment using policy-gradient approaches. As far as I learnt searching the web, this is a really hard enviroment to solve, mainly because the agent is given a reward only when it reaches the goal,which is a rare event. I tried to apply the so called "reward engineering", more or less substituting the reward given by the enviroment with a reward based upon the "energy" of the whole system (kinetic plus potential energy), but despite this no luck. I ask you:
- is correct to assume that MountainCar-v0 is beyond the current state of the art A3C algorithm, so that it requires some human intervention to suggest the agent the policy to follow, for example adopting reward engineering?
- could anyone provide any hint about which reward function could be used, provided that reward engineering is actually needed ?
Thanks for your help.