Questions tagged [policy-gradients]

For questions related to reinforcement learning algorithms often referred to as "policy gradients" (or "policy gradient algorithms"), which attempt to directly optimise a parameterised policy (without first attempting to estimate value functions) using gradients of an objective function with respect to the policy's parameters.

For more info, see this tutorial: https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html

189 questions
44
votes
2 answers

What is the relation between Q-learning and policy gradients methods?

As far as I understand, Q-learning and policy gradients (PG) are the two major approaches used to solve RL problems. While Q-learning aims to predict the reward of a certain action taken in a certain state, policy gradients directly predict the…
41
votes
5 answers

How should I handle invalid actions (when using REINFORCE)?

I want to create an AI which can play five-in-a-row/Gomoku. I want to use reinforcement learning for this. I use the policy gradient method, namely REINFORCE, with baseline. For the value and policy function approximation, I use a neural network. It…
17
votes
1 answer

How can policy gradients be applied in the case of multiple continuous actions?

Trusted Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are two cutting edge policy gradients algorithms. When using a single continuous action, normally, you would use some probability distribution (for example, Gaussian)…
14
votes
3 answers

Why does is make sense to normalize rewards per episode in reinforcement learning?

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so rewards = (rewards - rewards.mean()) / (rewards.std() + eps) on every episode individually. This is probably the baseline reduction, but I'm not entirely…
11
votes
1 answer

Do off-policy policy gradient methods exist?

Do off-policy policy gradient methods exist? I know that policy gradient methods themselves using the policy function for sampling rollouts. But can't we easily have a model for sampling from the environment? If so, I've never seen this done before.
echo
  • 673
  • 1
  • 5
  • 12
10
votes
2 answers

How do I handle negative rewards in policy gradients with the cross-entropy loss function?

I am using policy gradients in my reinforcement learning algorithm, and occasionally my environment provides a severe penalty (i.e. negative reward) when a wrong move is made. I'm using a neural network with stochastic gradient descent to learn the…
9
votes
2 answers

Why is baseline conditional on state at some timestep unbiased?

In the homework for the Berkeley RL class, problem 1, it asks you to show that the policy gradient is still unbiased if the baseline subtracted is a function of the state at time step $t$. $$ \triangledown _\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t)…
Laura C
  • 91
  • 2
9
votes
3 answers

Is REINFORCE the same as 'vanilla policy gradient'?

I don't know what people mean by 'vanilla policy gradient', but what comes to mind is REINFORCE, which is the simplest policy gradient algorithm I can think of. Is this an accurate statement? By REINFORCE I mean this surrogate objective $$…
8
votes
2 answers

Why does the "reward to go" trick in policy gradient methods work?

In the policy gradient method, there's a trick to reduce the variance of policy gradient. We use causality, and remove part of the sum over rewards so that only actions happened after the reward are taken into account (See here…
8
votes
1 answer

How is the policy gradient calculated in REINFORCE?

Reading Sutton and Barto, I see the following in describing policy gradients: How is the gradient calculated with respect to an action (taken at time t)? I've read implementations of the algorithm, but conceptually I'm not sure I understand how the…
8
votes
2 answers

Why are lambda returns so rarely used in policy gradients?

I've seen the Monte Carlo return $G_{t}$ being used in REINFORCE and the TD($0$) target $r_t + \gamma Q(s', a')$ in vanilla actor-critic. However, I've never seen someone use the lambda return $G^{\lambda}_{t}$ in these situations, nor in any other…
7
votes
1 answer

Why do Bellman equations indirectly create a policy?

I was watching a lecture on policy gradients and Bellman equations. And they say that a Bellman equation indirectly creates a policy, while the policy gradient directly learns a policy. Why is this?
7
votes
2 answers

Are policy gradient methods good for large discrete action spaces?

I have seen this question asked primarily in the context of continuous action spaces. I have a large action space (~2-4k discrete actions) for my custom environment that I cannot reduce down further: I am currently trying DQN approaches but was…
user9317212
  • 161
  • 2
  • 10
7
votes
1 answer

Which loss function should I use in REINFORCE, and what are the labels?

I understand that this is the update for the parameters of a policy in REINFORCE: $$ \Delta \theta_{t}=\alpha \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) v_{t}, $$ where $v_t$ is usually the discounted future reward and …
7
votes
1 answer

Why do the standard and deterministic Policy Gradient Theorems differ in their treatment of the derivatives of $R$ and the conditional probability?

I would like to understand the difference between the standard policy gradient theorem and the deterministic policy gradient theorem. These two theorem are quite different, although the only difference is whether the policy function is deterministic…
1
2 3
12 13