What made your DDPG implementation on your environment work?

Question

I am working on scheduling problem that has inherent randomness. The dimensions of action and state spaces are 1 and 5 respectively.

I am using DDPG, but it seems extremely unstable, and so far it isn't showing much learning. I've tried to

adjust the learning rate,
clip the gradients,
change the size of the replay buffer,
different neural net architectures, using SGD and Adam,
change the $\tau$ for the soft-update.

So, I'd like to know what people's experience is with this algorithm, for the environments where it was tested on the paper but also for other environments. What values of hyperparameters worked for you? Or what did you do? How cumbersome was the fine-tuning?

I don't think my implementation is incorrect, because I pretty much replicated this, and every other implementation I found did exactly the same.

(Also, I am not sure this is necessarily the best website to post this kind of question, but I decided to give a shot.)

Did you test it on other environments ? Test it on some OpenAI gym environments and see if it's working — Brale, May 23 '20 at 11:05
I just tested it on cart pole. It's not quite working. So, the algorithm may not be correct. I am going to double check, but regardless, I'd like to know what people's experience is when working with these algorithms on their own environments. I've implemented SAC. It wasn't very difficult to make it work, especially because my environment is not very challenging. — Schach21, May 23 '20 at 14:25
Update to my last comment. It is working, the training is just very slow and the reward somewhat noisy, but overall, i works. I used Pendulum-v0. — Schach21, May 23 '20 at 14:45
Indeed, DDPG is extremely unstable. A huge improvement related to DDPG is certainly D4PG. You should take a look a recent paper by Gabriel Barth-Maron, Matthew W. Hoffman, and others, called Distributional Policy Gradients, published in 2018 : https://paperswithcode.com/search?q_meta=&q=Distributed+Distributional+Deterministic+Policy+Gradients — jgauth, May 26 '20 at 21:49
Thanks for sharing! I'll certainly take a look at their paper. Also, for anyone reading this, please feel free to share improvements upon DDPG. — Schach21, May 27 '20 at 03:31

user5093249 · Accepted Answer · 2020-05-26T13:35:12.200

1

Below are some tweaks that helped me accelerate the training of DDPG on a Reacher-like environment:

Reducing the neural network size, compared to the original paper. Instead of:

2 hidden layers with 400 and 300 units respectively

I used 128 units for both hidden layers. I see in your implementation that you used 256, maybe you could try reducing this.

As suggested in the paper, I added batch normalization:

... to manually scale the features so they are in similar ranges across environments and units. We address this issue by adapting a recent technique from deep learning called batch normalization (Ioffe & Szegedy, 2015). This technique normalizes each dimension across the samples in a minibatch to have unit mean and variance.

The implementation you used does not seem to include this.

Reducing the value of $\sigma$, which is a parameter of the Ornstein-Uhlenbeck process used to enable exploration. Originally, it was $0.2$, I used $0.05$. (I can't find where this parameter is set in your code.)

I am not entirely sure if this will help in your environment, but it was just to give you some ideas.

PS: Here is a link to the code I followed to build DDPG and here is a plot of rewards per episode.

edited May 26 '20 at 13:35

answered May 24 '20 at 10:29

user5093249

722
4
8

do the actor and critic have the same architecture in your case? – Schach21 May 24 '20 at 20:35
@Schach21 , yes, similar to the paper. I just reduced the number of neurons in the first hidden layer from 400 to 128, and in the second hidden layer from 300 to 128. – user5093249 May 24 '20 at 20:42
1

what did you reward curve look like? Did learning happen and the reward curve stay up (like in the paper)? or was it rather noisy? I included batch normalization. That helped significantly. But I am not getting a smooth learning curve. Moreover, I've never seen a smooth learning curve from DDPG besides the original paper. – Schach21 May 25 '20 at 17:57
1

ok, perhaps I should mention that in my case, the goal was to solve the environment based on a known benchmark mean reward. The environment is solved when the mean reward remains above a threshold for 100 consecutive episodes. When this objective has been reached, I stopped the learning process. I did not test whether the reward curve stayed up or not after that (in fact, it's possible that it doesn't). If I have the time to test this specific aspect, I'll update my answer to include the resuts. – user5093249 May 25 '20 at 22:20
1

@Schach21 , I updated the answer with a plot of rewards and the code I based my implementation on. The plot of rewards is shown for 890 episodes, where each episode lasts 1000 steps at most. As it can be seen, there is a number of drops of the reward below 30, which is the target for solving this environment. – user5093249 May 26 '20 at 13:43
thanks! My implementation is working as well. My reward curve looks very similar to yours. – Schach21 May 26 '20 at 16:57

What made your DDPG implementation on your environment work?

1 Answers1