I've built a deep deterministic policy gradient reinforcement learning agent to be able to handle any games/tasks that have only one action. However, the agent seems to fail horribly when there are two or more actions. I tried to look online for any examples of somebody implementing DDPG on a multiple-action system, but people mostly applied it to the pendulum problem, which is a single-action problem.
For my current system, it is a 3 state, 2 continuous control actions system (One is to adjust the temperature of the system, the other one adjusts a mechanical position, both are continuous). However, I froze the second continuous action to be the optimal action all the time. So RL only has to manipulate one action. It solves within 30 episodes. However, the moment I allow the RL to try both continuous actions, it doesn't even converge after 1000 episodes. In fact, it diverges aggressively. The output of the actor-network seems to always be the max action, possibly because I am using a tanh
activation for the actor to provide output constraint. I added a penalty to large actions, but it does not seem to work for the 2 continuous control action case.
For my exploratory noise, I used Ornstein-Ulhenbeck noise, with means adjusted for the two different continuous actions. The mean of the noise is 10% of the mean of the action.
Is there any massive difference between single action and multiple action DDPG?
I changed the reward function to take into account both actions, have tried making a bigger network, tried priority replay, etc., but it appears I am missing something.
Does anyone here have any experience building a multiple-action DDPG and could give me some pointers?