For questions about the Trust Region Policy Optimization (TRPO) algorithm, which was introduced in the paper "Trust Region Policy Optimization" (2015) by J. Schulman et al.
Questions tagged [trust-region-policy-optimization]
17 questions
17
votes
1 answer
How can policy gradients be applied in the case of multiple continuous actions?
Trusted Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are two cutting edge policy gradients algorithms.
When using a single continuous action, normally, you would use some probability distribution (for example, Gaussian)…

Evalds Urtans
- 377
- 3
- 9
7
votes
2 answers
Why is the log probability replaced with the importance sampling in the loss function?
In the Trust-Region Policy Optimisation (TRPO) algorithm (and subsequently in PPO also), I do not understand the motivation behind replacing the log probability term from standard policy gradients
$$L^{PG}(\theta) = \hat{\mathbb{E}}_t[\log…

Mark
- 106
- 4
4
votes
1 answer
What is the difference between an on-policy distribution and state visitation frequency?
On-policy distribution is defined as follows in Sutton and Barto:
On the other hand, state visitation frequency is defined as follows in Trust Region Policy Optimization:
$$\rho_{\pi}(s) = \sum_{t=0}^{T} \gamma^t P(s_t=s|\pi)$$
Question: What is…

user529295
- 359
- 1
- 10
3
votes
1 answer
Maximizing or Minimizing in Trust Region Policy Optimization?
I happened to discover that the v1 (19 Feb 2015) and the v5 (20 Apr 2017) versions of TRPO papers have two different conclusions. The Equation (15) in v1 is $\min_\theta$ while the Equation (14) in v2 is $\max_\theta$. So, I'm a little bit confused…

fish_tree
- 247
- 1
- 6
3
votes
1 answer
Is (log-)standard deviation learned in TRPO and PPO or fixed instead?
After having read Williams (1992), where it was suggested that actually both the mean and standard deviation can be learned while training a REINFORCE algorithm on generating continuous output values, I assumed that this would be common practice…

Daniel B.
- 805
- 1
- 4
- 13
3
votes
1 answer
In lemma 1 of the TRPO paper, why isn't the expectation over $s'∼P(s'|s,a)$?
In the Trust Region Policy Optimization paper, in Lemma 1 of Appendix A, I didn't quite understand the transition from (21) from (20). In going from (20) to (21), $A^\pi(s_t, a_t)$ is substituted with its value. The value of $A^\pi(s_t, a_t)$ is…

A Das
- 131
- 2
3
votes
1 answer
Are these two TRPO objective functions equivalent?
In the TRPO paper, the objective to maximize is (equation 14)
$$
\mathbb{E}_{s\sim\rho_{\theta_\text{old}},a\sim q}\left[\frac{\pi_\theta(a|s)}{q(a|s)} Q_{\theta_\text{old}}(s,a) \right]
$$
which involves an expectation over states sampled with some…

udscbt
- 31
- 2
2
votes
1 answer
How is inequality 31 derived from equality 30 in lemma 2 of the "Trust Region Policy Optimization" paper?
In the Trust Region Policy Optimization paper, in Lemma 2 of Appendix A (p. 11), I didn't quite understand how inequality (31) is derived from equality (30), which is:
$$\bar{A}(s) = P(a \neq \tilde{a} | s) \mathbb{E}_{(a, \tilde{a}) \sim (\pi,…

Afshin Oroojlooy
- 175
- 1
- 7
2
votes
1 answer
What makes TRPO an actor-critic method? Where is the critic?
From what I understand, Trust Region Policy Optimization (TRPO) is a modification on Natural Policy Gradient (NPG) that derives the optimal step size $\beta$ from a KL constraint between the new and old policy.
NPG is a modification to "vanilla"…

thesofakillers
- 309
- 2
- 14
2
votes
1 answer
How can I implement the reward function for an 8-DOF robot arm with TRPO?
I need to get an 8-DOF (degrees of freedom) robot arm to move a specified point. I need to implement the TRPO RL code using OpenAI gym. I already have the gazebo environment. But I am unsure of how to write the code for the reward functions and the…

user1690356
- 21
- 1
2
votes
0 answers
How does the TRPO surrogate loss account for the error in the policy?
In the Trust Region Policy Optimization (TRPO) paper, on page 10, it is stated
An informal overview is as follows. Our proof relies on the notion of coupling, where we jointly define the policies $\pi$ and $\pi'$so that they choose the …

olliejday
- 21
- 3
1
vote
1 answer
Does importance sampling really improve sampling efficiency of TRPO or PPO?
Vanilla policy gradient has a loss function:
$$\mathcal{L}_{\pi_{\theta}(\theta)} = E_{\tau \sim \pi_{\theta}}[\sum\limits_{t = 0}^{\infty}\gamma^{t}r_{t}]$$
while in TRPO it is:
$$\mathcal{L}_{\pi_{\theta_{old}}(\theta)} = \frac{1}{1 - \gamma}E_{s,…

Magi Feeney
- 41
- 1
- 4
1
vote
1 answer
Why does each component of the tuple that represents an action have a categorical distribution in the TRPO paper?
I was going through the TRPO paper, and there was a line under Appendix D "Approximating Factored Policies with Neural Networks" in the last paragraph which I am unable to understand
The action consists of a tuple $(a_1, a_2..... , a_K)$ of…

srij
- 13
- 4
1
vote
0 answers
Why does PPO lead to a worse performance than TRPO in the same task?
I am training an agent with an Actor-Critic network and update it with TRPO so far. Now, I tried out PPO and the results are drastically different and bad. I only changed from TRPO to PPO, the rest of the environment and rewards are the same. PPO is…

thsolyt
- 31
- 2
0
votes
0 answers
Very high dimensional optimization with large budget, requiring high quality solutions
What would be theoretically the best performing optimization algorithm(s) in this case?
Very high dimensional problem: 250-500 parameters
Goal is to obtain very high quality solutions, not just "good" solutions
Parameters form multiple…

Charly Empereur-mot
- 101
- 3