7

I've built a deep deterministic policy gradient reinforcement learning agent to be able to handle any games/tasks that have only one action. However, the agent seems to fail horribly when there are two or more actions. I tried to look online for any examples of somebody implementing DDPG on a multiple-action system, but people mostly applied it to the pendulum problem, which is a single-action problem.

For my current system, it is a 3 state, 2 continuous control actions system (One is to adjust the temperature of the system, the other one adjusts a mechanical position, both are continuous). However, I froze the second continuous action to be the optimal action all the time. So RL only has to manipulate one action. It solves within 30 episodes. However, the moment I allow the RL to try both continuous actions, it doesn't even converge after 1000 episodes. In fact, it diverges aggressively. The output of the actor-network seems to always be the max action, possibly because I am using a tanh activation for the actor to provide output constraint. I added a penalty to large actions, but it does not seem to work for the 2 continuous control action case.

For my exploratory noise, I used Ornstein-Ulhenbeck noise, with means adjusted for the two different continuous actions. The mean of the noise is 10% of the mean of the action.

Is there any massive difference between single action and multiple action DDPG?

I changed the reward function to take into account both actions, have tried making a bigger network, tried priority replay, etc., but it appears I am missing something.

Does anyone here have any experience building a multiple-action DDPG and could give me some pointers?

nbro
  • 39,006
  • 12
  • 98
  • 176
Rui Nian
  • 423
  • 3
  • 13
  • 3
    Technically, the difference here is between actions in (some subset of) $\mathbb{R}$ and $\mathbb{R}^n$, not between 1 or more "actions". In other words, you have an action space here that might have multiple dimensions, and something is going wrong for your agent when there are 2 or more dimensions. In RL, when something described as "having 2 actions" this is usually an enumeration - i.e. the agent can take action A or action B, and there are no quantities involved. – Neil Slater Aug 25 '18 at 07:09
  • 2
    Hi Neil, thanks for the reply. Yes, for classic RL, it agents' actions are indeed discrete. However, in 2015, Lilicrap published a paper called "continuous control with deep reinforcement learning", and then in 2017, the TRPO and PPO algorithms were designed to allow agents to perform multiple continuous actions. So you are correct about my action being in a high dimension space. In my research, I am comparing model predictive control using trajectory optimization vs AI-based control. Usually, in robotics and mechatronics, robots move multiple pieces. I am trying to achieve that with RL. – Rui Nian Aug 26 '18 at 04:41
  • Current robots, such as the ones in the Tesla megafactory, have huge flaws in some tasks. I truly believe that with the new RL architectures, we can try to close these gaps. All the input here are welcome, and thanks so much guys for all the help! – Rui Nian Aug 26 '18 at 04:43
  • 1
    I suggest you [edit] a more accurate description of your RL problem to replace the sentence "For my current system, it is a 3 state, 2 action system." - because that is not how it would be described in any literature. May also be worth explaining how you have adjusted the exploration function ("actor noise"), as a mistake there would be key. – Neil Slater Aug 26 '18 at 09:16
  • 1
    Done! I will also try different exploratory noise means to see if it helps. – Rui Nian Aug 27 '18 at 03:46
  • 1
    Thanks. I was wondering if you had somehow failed to adjust for different scales of the two axes of action, but it doesn't look like it. I cannot really tell what is wrong. However, I would not personally expect DDPG to be quite so fragile when scaling up from one to two dimensions of action, so I'd still suspect something about your implementation - I just don't know what it could be. – Neil Slater Aug 28 '18 at 07:41
  • 1
    No worries Neil, I believe it has something to do with the difficulty of comprehension for the mechanical movement. I will try new ways for my RL to interpret the actions and rewards. – Rui Nian Aug 28 '18 at 16:04

0 Answers0