For questions related to the reinforcement learning algorithm called proximal policy optimization (PPO), which was introduced in the paper "Proximal Policy Optimization Algorithms" (2017) by John Schulman et al.
Questions tagged [proximal-policy-optimization]
105 questions
17
votes
1 answer
How can policy gradients be applied in the case of multiple continuous actions?
Trusted Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are two cutting edge policy gradients algorithms.
When using a single continuous action, normally, you would use some probability distribution (for example, Gaussian)…

Evalds Urtans
- 377
- 3
- 9
16
votes
3 answers
How to implement a variable action space in Proximal Policy Optimization?
I'm coding a Proximal Policy Optimization (PPO) agent with the Tensorforce library (which is built on top of TensorFlow).
The first environment was very simple. Now, I'm diving into a more complex environment, where all the actions are not available…

Max
- 163
- 1
- 6
7
votes
2 answers
Why is the log probability replaced with the importance sampling in the loss function?
In the Trust-Region Policy Optimisation (TRPO) algorithm (and subsequently in PPO also), I do not understand the motivation behind replacing the log probability term from standard policy gradients
$$L^{PG}(\theta) = \hat{\mathbb{E}}_t[\log…

Mark
- 106
- 4
6
votes
1 answer
Understanding multi-iteration updates of the model in the Proximal Policy Optimization algorithm
I have a general question about the updating of the network/model in the PPO algorithm.
If I understand it correctly, there are multiple iterations of weight updates done on the model with data that is created from the environment (with the model…

Marcel_marcel1991
- 631
- 3
- 12
6
votes
1 answer
What is the effect of parallel environments in reinforcement learning?
Do parallel environments improve the agent's ability to learn or does it not really make a difference? Specifically, I am using PPO, but I think this applies across the board to other algorithms too.

Dylan Kerler
- 243
- 2
- 7
6
votes
1 answer
How are continuous actions sampled (or generated) from the policy network in PPO?
I am trying to understand and reproduce the Proximal Policy Optimization (PPO) algorithm in detail. One thing that I find missing in the paper introducing the algorithm is how exactly actions $a_t$ are generated given the policy network…

Daniel B.
- 805
- 1
- 4
- 13
6
votes
2 answers
How is parallelism implemented in RL algorithms like PPO?
There are multiple ways to implement parallelism in reinforcement learning. One is to use parallel workers running in their own environments to collect data in parallel, instead of using replay memory buffers (this is how A3C works, for…

alex vdk
- 61
- 2
5
votes
2 answers
What are the best hyper-parameters to tune in reinforcement learning?
Obviously, this is somewhat subjective, but what hyper-parameters typically have the most significant impact on an RL agent's ability to learn? For example, the replay buffer size, learning rate, entropy coefficient, etc.
For example, in "normal"…

Dylan Kerler
- 243
- 2
- 7
4
votes
2 answers
Why does the clipped surrogate objective work in Proximal Policy Optimization?
In Proximal Policy Optimization Algorithms (2017), Schulman et al. write
With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.
I don't…

16Aghnar
- 591
- 2
- 10
4
votes
1 answer
Mathematically, what is happening differently in the neural net during exploration vs. exploitation?
I want to understand roughly what is happening in the neural network of an RL agent when it is exploring vs. exploiting. For example, are the network weights not being updated when the agent is exploiting? Or somehow being updated to a lesser…

Vladimir Belik
- 342
- 2
- 12
4
votes
1 answer
Do we use validation and test sets for training a reinforcement learning agent?
I am pretty new to reinforcement learning and was working with some code for the PPO and DQN algorithms. After looking at the code, I noticed that the authors did not include any code to setup a validation or testing dataloader. In most other…

krishnab
- 197
- 7
4
votes
1 answer
What are the pros and cons of using standard deviation or entropy for exploration in PPO?
When trying to implement my own PPO (Proximal Policy Optimizer), I came across two different implementations :
Exploration with std
Collect trajectories on $N$ timesteps, by using a policy-centered distribution with progressively trained std…

Loheek
- 266
- 2
- 6
3
votes
3 answers
Why clip the PPO objective on only one side?
In PPO with clipped surrogate objective (see the paper here), we have the following objective:
The shape of the function is shown in the image below, and depends on whether the advantage is positive or negative.
The min() operator makes…

Jer
- 31
- 2
3
votes
1 answer
Why can the sum over timesteps in the Vanilla Policy Gradient be ignored?
I understand how to derive the vanilla policy gradient
$$
\begin{align}
\nabla_{\theta}J(\pi_{\theta}) = \mathbb{E}_{\pi_{\theta}} \left[ \sum_{t = 0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_{t} \mid s_{t}) \hat{A}^{\pi_{\theta}}(s_{t}, a_{t})…

Peter
- 55
- 4
3
votes
1 answer
Does SAC perform better than PPO in sample-expensive tasks with discrete action spaces?
I am currently using Proximal Policy Optimization (PPO) to solve my RL task. However, after reading about Soft Actor-Critic (SAC) now I am unsure whether I should stick to PPO or switch to SAC. Moreover, from this post, it seems that much of the…

Aeryan
- 53
- 1
- 4