Questions tagged [ddpg]

For questions related to the reinforcement learning algorithm called Deep Deterministic Policy Gradient (DDPG).

See https://spinningup.openai.com/en/latest/algorithms/ddpg.html for more info.

59 questions
14
votes
2 answers

How large should the replay buffer be?

I'm learning DDPG algorithm by following the following link: Open AI Spinning Up document on DDPG, where it is written In order for the algorithm to have stable behavior, the replay buffer should be large enough to contain a wide range of…
7
votes
0 answers

Is there a difference in the architecture of deep reinforcement learning when multiple actions are performed instead of a single action?

I've built a deep deterministic policy gradient reinforcement learning agent to be able to handle any games/tasks that have only one action. However, the agent seems to fail horribly when there are two or more actions. I tried to look online for…
Rui Nian
  • 423
  • 3
  • 13
7
votes
2 answers

Why are reinforcement learning methods sample inefficient?

Reinforcement learning methods are considered to be extremely sample inefficient. For example, in a recent DeepMind paper by Hessel et al., they showed that in order to reach human-level performance on an Atari game running at 60 frames per second…
rrz0
  • 263
  • 2
  • 7
5
votes
1 answer

How does the Ornstein-Uhlenbeck process work, and how it is used in DDPG?

In section 3 of the paper Continuous control with deep reinforcement learning, the authors write As detailed in the supplementary materials we used an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930) to generate temporally correlated…
5
votes
1 answer

Why Q2 is a more or less independant estimate in Twin Delayed DDPG (TD3)?

Twin Delayed Deep Deterministic (TD3) policy gradient is inspired by both double Q-learning and double DQN. In double Q-learning, I understand that Q1 and Q2 are independent because they are trained on different samples. In double DQN, I understand…
4
votes
1 answer

What made your DDPG implementation on your environment work?

I am working on scheduling problem that has inherent randomness. The dimensions of action and state spaces are 1 and 5 respectively. I am using DDPG, but it seems extremely unstable, and so far it isn't showing much learning. I've tried to adjust…
4
votes
1 answer

How to avoid rapid actuator movements in favor of smooth movements in a continuous space and action space problem?

I'm working on a continuous state / continuous action controller. It shall control a certain roll angle of an aircraft by issuing the correct aileron commands (in $[-1, 1]$). To this end, I use a neural network and the DDPG algorithm, which shows…
4
votes
0 answers

What is the simplest policy gradient method to implement for a problem continuous action space?

I have a problem I would like to tackle with RL, but I am not sure if it is even doable. My agent has to figure out how to fill a very large vector (let's say from 600 to 4000 in the most complex setting) made of natural numbers, i.e. a 600 vector…
3
votes
0 answers

How to deal with a moving target in the Lunar Lander environment with DDPG?

I have noticed that DDPG does rather well at solving environments with a static target. For example, the default of Lunar Lander, the flags do not change position. So the DDPG model learns how to get to the center of the screen and land fairly…
3
votes
1 answer

Appropriate algorithm for RL problem with sparse rewards, continuous actions and significant stochasticity

I'm working on a RL problem with the following properties: The rewards are extremely sparse i.e. all rewards are 0 except the terminal non-zero reward. Ideally I would not use any reward engineering as that would lead to a different optimization…
BGa
  • 229
  • 1
  • 5
3
votes
0 answers

How does adding noise to the action in DDPG help in learning?

I can't understand how playing with the action generated by the actor network in DDPG by adding the noise term helps in exploration.
3
votes
1 answer

Purpose of using actor-critic algorithms under deterministic MDP dynamics?

One of the main disadvantages of the MC Policy Gradient algorithm (REINFORCE) as described say here is the fact that it has high variance (returns, which we sample, will significantly vary from episode to episode). Therefore it is perfectly…
3
votes
1 answer

Training actor-critic algorithms in games with opponents

I am wondering how am I supposed to train a model using actor/critic algorithms in environments with opponents. I tried the followings (using A3C and DDPG): Play against random player. I had rather good results, but not as good as expected since…
3
votes
0 answers

Should noise (such as OU) be decreased over time in actor / critic algorithms?

In most of RL algorithms I saw, there is a coefficient that reduces actions exploration over time, to help convergence. But in Actor-Critic, or other algorithms (A3C, DDPG, ...) used in continuous action spaces, the different implementation I saw…
3
votes
0 answers

Can I use deterministic policy gradient methods for stochastic policy learning?

Can I treat a stochastic policy (over a finite action space of size $n$) as a deterministic policy (in the set of probability distribution in $\mathbb{R}^n$)? It seems to me that nothing is broken by making this mental translation, except that the…
1
2 3 4