Highest Voted 'proximal-policy-optimization' Questions - Artificial Intelligence Stack Exchange

17

votes

1 answer

How can policy gradients be applied in the case of multiple continuous actions?

Trusted Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are two cutting edge policy gradients algorithms. When using a single continuous action, normally, you would use some probability distribution (for example, Gaussian)…

asked Sep 21 '17 at 08:27

Evalds Urtans

377
3
9

16

votes

3 answers

How to implement a variable action space in Proximal Policy Optimization?

I'm coding a Proximal Policy Optimization (PPO) agent with the Tensorforce library (which is built on top of TensorFlow). The first environment was very simple. Now, I'm diving into a more complex environment, where all the actions are not available…

reinforcement-learning proximal-policy-optimization discrete-action-spaces action-spaces

asked Aug 29 '18 at 16:04

Max

163
1
6

7

votes

2 answers

Why is the log probability replaced with the importance sampling in the loss function?

In the Trust-Region Policy Optimisation (TRPO) algorithm (and subsequently in PPO also), I do not understand the motivation behind replacing the log probability term from standard policy gradients $$L^{PG}(\theta) = \hat{\mathbb{E}}_t[\log…

reinforcement-learning deep-rl proximal-policy-optimization importance-sampling trust-region-policy-optimization

asked Aug 23 '18 at 07:17

Mark

106
4

6

votes

1 answer

Understanding multi-iteration updates of the model in the Proximal Policy Optimization algorithm

I have a general question about the updating of the network/model in the PPO algorithm. If I understand it correctly, there are multiple iterations of weight updates done on the model with data that is created from the environment (with the model…

reinforcement-learning deep-rl policy-gradients actor-critic-methods proximal-policy-optimization

asked Mar 04 '18 at 10:38

Marcel_marcel1991

631
3
12

6

votes

1 answer

What is the effect of parallel environments in reinforcement learning?

Do parallel environments improve the agent's ability to learn or does it not really make a difference? Specifically, I am using PPO, but I think this applies across the board to other algorithms too.

reinforcement-learning deep-rl proximal-policy-optimization

asked May 23 '21 at 12:18

Dylan Kerler

243
2
7

6

votes

1 answer

How are continuous actions sampled (or generated) from the policy network in PPO?

I am trying to understand and reproduce the Proximal Policy Optimization (PPO) algorithm in detail. One thing that I find missing in the paper introducing the algorithm is how exactly actions $a_t$ are generated given the policy network…

reinforcement-learning implementation proximal-policy-optimization continuous-action-spaces

asked Dec 12 '20 at 01:42

Daniel B.

805
1
4
13

6

votes

2 answers

How is parallelism implemented in RL algorithms like PPO?

There are multiple ways to implement parallelism in reinforcement learning. One is to use parallel workers running in their own environments to collect data in parallel, instead of using replay memory buffers (this is how A3C works, for…

reinforcement-learning actor-critic-methods implementation proximal-policy-optimization

asked Apr 30 '19 at 01:15

alex vdk

61
2

5

votes

2 answers

What are the best hyper-parameters to tune in reinforcement learning?

Obviously, this is somewhat subjective, but what hyper-parameters typically have the most significant impact on an RL agent's ability to learn? For example, the replay buffer size, learning rate, entropy coefficient, etc. For example, in "normal"…

reinforcement-learning deep-rl hyperparameter-optimization hyper-parameters proximal-policy-optimization

asked May 28 '21 at 11:21

Dylan Kerler

243
2
7

4

votes

2 answers

Why does the clipped surrogate objective work in Proximal Policy Optimization?

In Proximal Policy Optimization Algorithms (2017), Schulman et al. write With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse. I don't…

reinforcement-learning papers objective-functions proximal-policy-optimization

asked Sep 05 '18 at 15:42

16Aghnar

591
2
10

4

votes

1 answer

Mathematically, what is happening differently in the neural net during exploration vs. exploitation?

I want to understand roughly what is happening in the neural network of an RL agent when it is exploring vs. exploiting. For example, are the network weights not being updated when the agent is exploiting? Or somehow being updated to a lesser…

reinforcement-learning deep-rl proximal-policy-optimization exploration-exploitation-tradeoff

asked Jun 07 '22 at 21:00

Vladimir Belik

342
2
12

4

votes

1 answer

Do we use validation and test sets for training a reinforcement learning agent?

I am pretty new to reinforcement learning and was working with some code for the PPO and DQN algorithms. After looking at the code, I noticed that the authors did not include any code to setup a validation or testing dataloader. In most other…

reinforcement-learning training dqn proximal-policy-optimization

asked Nov 22 '21 at 15:24

krishnab

197
7

4

votes

1 answer

What are the pros and cons of using standard deviation or entropy for exploration in PPO?

When trying to implement my own PPO (Proximal Policy Optimizer), I came across two different implementations : Exploration with std Collect trajectories on $N$ timesteps, by using a policy-centered distribution with progressively trained std…

neural-networks reinforcement-learning policy-gradients proximal-policy-optimization

asked Sep 08 '19 at 10:51

Loheek

266
2
6

3

votes

3 answers

Why clip the PPO objective on only one side?

In PPO with clipped surrogate objective (see the paper here), we have the following objective: The shape of the function is shown in the image below, and depends on whether the advantage is positive or negative. The min() operator makes…

reinforcement-learning deep-rl proximal-policy-optimization

asked Oct 24 '22 at 11:57

Jer

31
2

3

votes

1 answer

Why can the sum over timesteps in the Vanilla Policy Gradient be ignored?

I understand how to derive the vanilla policy gradient $$ \begin{align} \nabla_{\theta}J(\pi_{\theta}) = \mathbb{E}_{\pi_{\theta}} \left[ \sum_{t = 0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_{t} \mid s_{t}) \hat{A}^{\pi_{\theta}}(s_{t}, a_{t})…

reinforcement-learning deep-rl policy-gradients proximal-policy-optimization

asked Jul 05 '22 at 18:01

Peter

55
4

3

votes

1 answer

Does SAC perform better than PPO in sample-expensive tasks with discrete action spaces?

I am currently using Proximal Policy Optimization (PPO) to solve my RL task. However, after reading about Soft Actor-Critic (SAC) now I am unsure whether I should stick to PPO or switch to SAC. Moreover, from this post, it seems that much of the…

reinforcement-learning comparison proximal-policy-optimization soft-actor-critic

asked Jun 27 '22 at 21:47

Aeryan

53
1
4

Questions tagged [proximal-policy-optimization]