Highest Voted 'policy-gradients' Questions - Artificial Intelligence Stack Exchange

44

votes

2 answers

What is the relation between Q-learning and policy gradients methods?

As far as I understand, Q-learning and policy gradients (PG) are the two major approaches used to solve RL problems. While Q-learning aims to predict the reward of a certain action taken in a certain state, policy gradients directly predict the…

asked Apr 28 '18 at 03:11

sloth

565
1
5
6

41

votes

5 answers

How should I handle invalid actions (when using REINFORCE)?

I want to create an AI which can play five-in-a-row/Gomoku. I want to use reinforcement learning for this. I use the policy gradient method, namely REINFORCE, with baseline. For the value and policy function approximation, I use a neural network. It…

reinforcement-learning policy-gradients actor-critic-methods reinforce reward-design

asked Mar 14 '17 at 14:26

Molnár István

694
1
7
11

17

votes

1 answer

How can policy gradients be applied in the case of multiple continuous actions?

Trusted Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are two cutting edge policy gradients algorithms. When using a single continuous action, normally, you would use some probability distribution (for example, Gaussian)…

deep-learning reinforcement-learning policy-gradients proximal-policy-optimization trust-region-policy-optimization

asked Sep 21 '17 at 08:27

Evalds Urtans

377
3
9

14

votes

3 answers

Why does is make sense to normalize rewards per episode in reinforcement learning?

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so rewards = (rewards - rewards.mean()) / (rewards.std() + eps) on every episode individually. This is probably the baseline reduction, but I'm not entirely…

reinforcement-learning policy-gradients variance-reduction reward-normalization

asked Jan 24 '19 at 13:56

Gulzar

729
1
8
23

11

votes

1 answer

Do off-policy policy gradient methods exist?

Do off-policy policy gradient methods exist? I know that policy gradient methods themselves using the policy function for sampling rollouts. But can't we easily have a model for sampling from the environment? If so, I've never seen this done before.

reinforcement-learning policy-gradients off-policy-methods

asked Dec 23 '17 at 18:41

echo

673
1
5
12

10

votes

2 answers

How do I handle negative rewards in policy gradients with the cross-entropy loss function?

I am using policy gradients in my reinforcement learning algorithm, and occasionally my environment provides a severe penalty (i.e. negative reward) when a wrong move is made. I'm using a neural network with stochastic gradient descent to learn the…

reinforcement-learning policy-gradients rewards cross-entropy stochastic-gradient-descent

asked Nov 29 '16 at 06:10

jstaker7

209
1
2
5

9

votes

2 answers

Why is baseline conditional on state at some timestep unbiased?

In the homework for the Berkeley RL class, problem 1, it asks you to show that the policy gradient is still unbiased if the baseline subtracted is a function of the state at time step $t$. $$ \triangledown _\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t)…

reinforcement-learning policy-gradients proofs

asked Sep 09 '18 at 20:31

Laura C

91
2

9

votes

3 answers

Is REINFORCE the same as 'vanilla policy gradient'?

I don't know what people mean by 'vanilla policy gradient', but what comes to mind is REINFORCE, which is the simplest policy gradient algorithm I can think of. Is this an accurate statement? By REINFORCE I mean this surrogate objective $$…

reinforcement-learning comparison terminology policy-gradients reinforce

asked Mar 27 '19 at 13:01

yewang

331
2
5

8

votes

2 answers

Why does the "reward to go" trick in policy gradient methods work?

In the policy gradient method, there's a trick to reduce the variance of policy gradient. We use causality, and remove part of the sum over rewards so that only actions happened after the reward are taken into account (See here…

reinforcement-learning math policy-gradients rewards reward-to-go

asked Dec 20 '18 at 01:00

Konstantin Solomatov

288
2
10

8

votes

1 answer

How is the policy gradient calculated in REINFORCE?

Reading Sutton and Barto, I see the following in describing policy gradients: How is the gradient calculated with respect to an action (taken at time t)? I've read implementations of the algorithm, but conceptually I'm not sure I understand how the…

reinforcement-learning policy-gradients sutton-barto notation reinforce

asked Apr 21 '19 at 19:23

Hanzy

499
3
10

8

votes

2 answers

Why are lambda returns so rarely used in policy gradients?

I've seen the Monte Carlo return $G_{t}$ being used in REINFORCE and the TD($0$) target $r_t + \gamma Q(s', a')$ in vanilla actor-critic. However, I've never seen someone use the lambda return $G^{\lambda}_{t}$ in these situations, nor in any other…

reinforcement-learning policy-gradients reinforce return td-lambda

asked Jan 17 '19 at 19:27

jhinGhin

83
3

7

votes

1 answer

Why do Bellman equations indirectly create a policy?

I was watching a lecture on policy gradients and Bellman equations. And they say that a Bellman equation indirectly creates a policy, while the policy gradient directly learns a policy. Why is this?

reinforcement-learning comparison policy-gradients bellman-equations

asked Dec 18 '17 at 13:27

echo

673
1
5
12

7

votes

2 answers

Are policy gradient methods good for large discrete action spaces?

I have seen this question asked primarily in the context of continuous action spaces. I have a large action space (~2-4k discrete actions) for my custom environment that I cannot reduce down further: I am currently trying DQN approaches but was…

reinforcement-learning dqn policy-gradients

asked May 18 '21 at 13:57

user9317212

161
2
10

7

votes

1 answer

Which loss function should I use in REINFORCE, and what are the labels?

I understand that this is the update for the parameters of a policy in REINFORCE: $$ \Delta \theta_{t}=\alpha \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) v_{t}, $$ where $v_t$ is usually the discounted future reward and …

reinforcement-learning backpropagation policy-gradients reinforce cross-entropy

asked Sep 16 '20 at 15:09

S2673

560
4
16

7

votes

1 answer

Why do the standard and deterministic Policy Gradient Theorems differ in their treatment of the derivatives of $R$ and the conditional probability?

I would like to understand the difference between the standard policy gradient theorem and the deterministic policy gradient theorem. These two theorem are quite different, although the only difference is whether the policy function is deterministic…

reinforcement-learning policy-gradients policy-gradient-theorem deterministic-pg-theorem theorems

asked Aug 04 '20 at 07:10

fabian

173
4

Questions tagged [policy-gradients]