Highest Voted 'trust-region-policy-optimization' Questions - Artificial Intelligence Stack Exchange

17

votes

1 answer

How can policy gradients be applied in the case of multiple continuous actions?

Trusted Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are two cutting edge policy gradients algorithms. When using a single continuous action, normally, you would use some probability distribution (for example, Gaussian)…

asked Sep 21 '17 at 08:27

Evalds Urtans

377
3
9

7

votes

2 answers

Why is the log probability replaced with the importance sampling in the loss function?

In the Trust-Region Policy Optimisation (TRPO) algorithm (and subsequently in PPO also), I do not understand the motivation behind replacing the log probability term from standard policy gradients $$L^{PG}(\theta) = \hat{\mathbb{E}}_t[\log…

reinforcement-learning deep-rl proximal-policy-optimization importance-sampling trust-region-policy-optimization

asked Aug 23 '18 at 07:17

Mark

106
4

4

votes

1 answer

What is the difference between an on-policy distribution and state visitation frequency?

On-policy distribution is defined as follows in Sutton and Barto: On the other hand, state visitation frequency is defined as follows in Trust Region Policy Optimization: $$\rho_{\pi}(s) = \sum_{t=0}^{T} \gamma^t P(s_t=s|\pi)$$ Question: What is…

reinforcement-learning comparison sutton-barto trust-region-policy-optimization on-policy-distribution

asked Dec 08 '21 at 10:36

user529295

359
1
10

3

votes

1 answer

Maximizing or Minimizing in Trust Region Policy Optimization?

I happened to discover that the v1 (19 Feb 2015) and the v5 (20 Apr 2017) versions of TRPO papers have two different conclusions. The Equation (15) in v1 is $\min_\theta$ while the Equation (14) in v2 is $\max_\theta$. So, I'm a little bit confused…

reinforcement-learning optimization deep-rl trust-region-policy-optimization

asked Jul 15 '18 at 08:43

fish_tree

247
1
6

3

votes

1 answer

Is (log-)standard deviation learned in TRPO and PPO or fixed instead?

After having read Williams (1992), where it was suggested that actually both the mean and standard deviation can be learned while training a REINFORCE algorithm on generating continuous output values, I assumed that this would be common practice…

reinforcement-learning deep-rl proximal-policy-optimization trust-region-policy-optimization

asked Feb 12 '21 at 13:25

Daniel B.

805
1
4
13

3

votes

1 answer

In lemma 1 of the TRPO paper, why isn't the expectation over $s'∼P(s'|s,a)$?

In the Trust Region Policy Optimization paper, in Lemma 1 of Appendix A, I didn't quite understand the transition from (21) from (20). In going from (20) to (21), $A^\pi(s_t, a_t)$ is substituted with its value. The value of $A^\pi(s_t, a_t)$ is…

reinforcement-learning proofs papers trust-region-policy-optimization

asked Nov 21 '19 at 22:38

A Das

131
2

3

votes

1 answer

Are these two TRPO objective functions equivalent?

In the TRPO paper, the objective to maximize is (equation 14) $$ \mathbb{E}_{s\sim\rho_{\theta_\text{old}},a\sim q}\left[\frac{\pi_\theta(a|s)}{q(a|s)} Q_{\theta_\text{old}}(s,a) \right] $$ which involves an expectation over states sampled with some…

reinforcement-learning policy-gradients proximal-policy-optimization trust-region-policy-optimization

asked Oct 07 '19 at 05:15

udscbt

31
2

2

votes

1 answer

How is inequality 31 derived from equality 30 in lemma 2 of the "Trust Region Policy Optimization" paper?

In the Trust Region Policy Optimization paper, in Lemma 2 of Appendix A (p. 11), I didn't quite understand how inequality (31) is derived from equality (30), which is: $$\bar{A}(s) = P(a \neq \tilde{a} | s) \mathbb{E}_{(a, \tilde{a}) \sim (\pi,…

reinforcement-learning deep-rl papers proofs trust-region-policy-optimization

asked Nov 27 '18 at 16:52

Afshin Oroojlooy

175
1
7

2

votes

1 answer

What makes TRPO an actor-critic method? Where is the critic?

From what I understand, Trust Region Policy Optimization (TRPO) is a modification on Natural Policy Gradient (NPG) that derives the optimal step size $\beta$ from a KL constraint between the new and old policy. NPG is a modification to "vanilla"…

reinforcement-learning policy-gradients actor-critic-methods trust-region-policy-optimization policy-based-methods

asked Oct 25 '22 at 20:28

thesofakillers

309
2
14

2

votes

1 answer

How can I implement the reward function for an 8-DOF robot arm with TRPO?

I need to get an 8-DOF (degrees of freedom) robot arm to move a specified point. I need to implement the TRPO RL code using OpenAI gym. I already have the gazebo environment. But I am unsure of how to write the code for the reward functions and the…

reinforcement-learning gym reward-design reward-functions trust-region-policy-optimization

asked Mar 05 '20 at 02:10

user1690356

21
1

2

votes

0 answers

How does the TRPO surrogate loss account for the error in the policy?

In the Trust Region Policy Optimization (TRPO) paper, on page 10, it is stated An informal overview is as follows. Our proof relies on the notion of coupling, where we jointly define the policies $\pi$ and $\pi'$so that they choose the …

reinforcement-learning deep-rl papers policy-gradients trust-region-policy-optimization

asked May 02 '19 at 15:31

olliejday

21
3

1

vote

1 answer

Does importance sampling really improve sampling efficiency of TRPO or PPO?

Vanilla policy gradient has a loss function: $$\mathcal{L}_{\pi_{\theta}(\theta)} = E_{\tau \sim \pi_{\theta}}[\sum\limits_{t = 0}^{\infty}\gamma^{t}r_{t}]$$ while in TRPO it is: $$\mathcal{L}_{\pi_{\theta_{old}}(\theta)} = \frac{1}{1 - \gamma}E_{s,…

reinforcement-learning policy-gradients importance-sampling trust-region-policy-optimization sample-efficiency

asked Feb 16 '22 at 11:22

Magi Feeney

41
1
4

1

vote

1 answer

Why does each component of the tuple that represents an action have a categorical distribution in the TRPO paper?

I was going through the TRPO paper, and there was a line under Appendix D "Approximating Factored Policies with Neural Networks" in the last paragraph which I am unable to understand The action consists of a tuple $(a_1, a_2..... , a_K)$ of…

reinforcement-learning papers trust-region-policy-optimization action-spaces

asked Nov 30 '20 at 16:30

srij

13
4

1

vote

0 answers

Why does PPO lead to a worse performance than TRPO in the same task?

I am training an agent with an Actor-Critic network and update it with TRPO so far. Now, I tried out PPO and the results are drastically different and bad. I only changed from TRPO to PPO, the rest of the environment and rewards are the same. PPO is…

reinforcement-learning policy-gradients hyperparameter-optimization proximal-policy-optimization trust-region-policy-optimization

asked Nov 22 '20 at 18:09

thsolyt

31
2

0

votes

0 answers

Very high dimensional optimization with large budget, requiring high quality solutions

What would be theoretically the best performing optimization algorithm(s) in this case? Very high dimensional problem: 250-500 parameters Goal is to obtain very high quality solutions, not just "good" solutions Parameters form multiple…

reinforcement-learning optimization trust-region-policy-optimization bayesian-optimization curse-of-dimensionality

asked Mar 05 '23 at 09:07

Charly Empereur-mot

101
3

Questions tagged [trust-region-policy-optimization]