What is the proof that "reward-to-go" reduces variance of policy gradient?

Asked Jun 10 '20 at 13:38

Active Oct 10 '20 at 15:51

Viewed 408 times

I am following the OpenAI's spinning up tutorial Part 3: Intro to Policy Optimization. It is mentioned there that the reward-to-go reduces the variance of the policy gradient. While I understand the intuition behind it, I struggle to find a proof in the literature.

edited Oct 10 '20 at 15:51

nbro

39,006
12
98
176

asked Jun 10 '20 at 13:38

sirKris van Dela

1

Does the answer to [this](https://ai.stackexchange.com/questions/9614/why-does-the-reward-to-go-trick-in-policy-gradient-methods-work?rq=1) question answer yours as well? – user5093249 Jun 10 '20 at 13:55
No, the linked question only proofs that the reward-to-go does not introduce any bias to the gradient estimate. – sirKris van Dela Jun 10 '20 at 14:14
This is nontrivial to prove, actually anything involving stochastic function approximation is nontrivial. You can search research papers, you won't find it in any book right now – FourierFlux Jun 10 '20 at 14:33

What is the proof that "reward-to-go" reduces variance of policy gradient?

0 Answers0

Linked