Why does the "reward to go" trick in policy gradient methods work?

Question

In the policy gradient method, there's a trick to reduce the variance of policy gradient. We use causality, and remove part of the sum over rewards so that only actions happened after the reward are taken into account (See here http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf, slide 18).

Why does it work? I understand the intuitive explanation, but what's the rigorous proof of it? Can you point me to some papers?

score 10 · Accepted Answer · edited Nov 25 '20 at 20:14

An important thing we're going to need is what is called the "Expected Grad-Log-Prob Lemma here" (proof included on that page), which says that (for any $t$):

$$\mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) \right] = 0.$$

Taking the analytical expression of the gradient (from, for example, slide 9) as a starting point:

$$\begin{aligned} \nabla_{\theta} J(\theta) &= \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \left( \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \right) \left( \sum_{t=1}^T r(s_t, a_t) \right) \right] \\ % &= \sum_{t=1}^T \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=1}^T r(s_{t'}, a_{t'}) \right] \\ % &= \sum_{t=1}^T \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=1}^{t-1} r(s_{t'}, a_{t'}) + \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=t}^T r(s_{t'}, a_{t'}) \right] \\ % &= \sum_{t=1}^T \left( \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=1}^{t-1} r(s_{t'}, a_{t'}) \right] \\ + \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=t}^T r(s_{t'}, a_{t'}) \right] \right) \\ \end{aligned}$$

At the $t^{th}$ "iteration" of the outer sum, the random variables $ \sum_{t'=1}^{t-1} r(s_{t'}, a_{t'}) $ and $ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) $ are independent (we assume, by definition, the action only depends on the most recent state), which means we are allowed to split the expectation:

$$\nabla_{\theta} J(\theta) = \sum_{t=1}^T \left( \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \sum_{t'=1}^{t-1} r(s_{t'}, a_{t'}) \right] \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \right] \\ + \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=t}^T r(s_{t'}, a_{t'}) \right] \right)$$

The first expectation can now be replaced by $0$ due to the lemma mentioned at the top of the post:

$$ \begin{aligned} \nabla_{\theta} J(\theta) % &= \sum_{t=1}^T \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=t}^T r(s_{t'}, a_{t'}) \right] \\ % &= \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \left( \sum_{t'=t}^T r(s_{t'}, a_{t'}) \right). \\ \end{aligned} $$

The expression on slide 18 of the linked slides is an unbiased, sample-based estimator of this gradient:

$$\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta} (a_{i, t} \mid s_{i, t}) \left( \sum_{t'=t}^T r(s_{i, t'}, a_{i, t'}) \right)$$

For a more formal treatment of the claim that we can pull $\sum_{t'=1}^{t-1} r(s_{t'}, a_{t'})$ out of an expectation due to the Markov property, see this page: https://spinningup.openai.com/en/latest/spinningup/extra_pg_proof1.html

Why is the form of "Expected Grad-Log-Prob Lemma" mentioned at the top of the post legit? It is different than the form of the proof shown in the link (the proof in the link has no problem). Your lemma is an expectation over **trajectories**. — starriet, Jan 04 '22 at 03:09
@starriet For the individual choice of a single action at one point in time within a trajectory, the trajectory as a whole has no influence (our policy does not look at the history or the future, just at the current state), so $\mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \right] = \mathbb{E}_{a_t \sim \pi_{\theta}} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \right]$, where the right-hand side more closely matches that notation from the link. — Dennis Soemers, Jan 04 '22 at 09:43
@starriet However, we use the trajectory-based notation because in the larger equation, we have these sums over all time steps, and that's the part where the full trajectory actually becomes relevant. — Dennis Soemers, Jan 04 '22 at 09:45
Thanks, I think I got your point. If I understood correctly, the right-hand side is at a *specific state*, and the left-hand side is at the various states and actions of *all possible trajectories*, right? Since the RHS is 0 and it's about one state, LHS is also 0 because LHS is an expectation over all states (this is how I understood. please let me know if I'm wrong). — starriet, Jan 04 '22 at 12:41
But still, It's difficult to understand why the form of the LHS is legit. It's an expectation over trajectories, but the term inside of the square bracket is a function of action and state. Given that a trajectory is composed of many states and actions, how the expectation over trajectories is possible if the term inside of the brackets is smaller than a trajectory? — starriet, Jan 04 '22 at 12:51
@starriet Suppose that we have a "trajectory" $\tau$ where we roll 100 dice in a row. Forget about RL and actions and policies for a moment, just rolling 100 dice in a row. These dice rolls are independent of each other, so the expected value of any individual roll $x_t$ is $\mathbb{E}_{x_{t} \sim \text{dice-roll}} [ x_t ] = 3.5$. Here we take the expectation over just that single dice-roll event. We can also take it over the entire trajectory of all rolls though, but it won't change anything since they're independent: $\mathbb{E}_{\tau \sim \text{100 dice-rolls}} [ x_t ] = 3.5$. — Dennis Soemers, Jan 04 '22 at 12:59
@starriet So here in RL with the actions it's kind of a similar idea. The expectation taken over the entire trajectory takes lots of extra irrelevant information into account which doesn't change anything due to independence (but it's convenient because we do need that "bigger" expectation elsewhere in the equation) — Dennis Soemers, Jan 04 '22 at 13:00

score 1 · Answer 2 · answered Sep 27 '22 at 14:36

A correct proof is given at https://spinningup.openai.com/en/latest/spinningup/extra_pg_proof1.html. It uses, among other techniques, the probability chain rule.

On the contrary, an attempt to prove "reward to go" by applying $ X, Y \: independent \implies E[XY] = E[X]E[Y]$ does not work.

The random variables $ \sum_{t'=1}^{t-1} r(s_{t'}, a_{t'}) $ and $ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) $ are generally not independent. They would be independent, if $ P[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \mid \sum_{t'=1}^{t-1} r(s_{t'}, a_{t'}) ] = P[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) ] $ but this does not follow from $ \pi_{\theta}(a_t \mid s_t) = \pi_{\theta}(a_t \mid s_t, s_{t-1},...,s_1,a_{t-1}, ..., a_1) $

Consider the following setup as an example for the dependency between the two random variables in question.

times = {1, 2, 3, ..., T}

states = {1, 2, 3, ..., T}

actions = {0, 1}

$ \pi(a_t = 0 \mid s_t) = \frac{1}{s_t} $

$ r(s_t, a_t) = 1 $ i.e. the event reward at time t equals 1 and the total reward from time 1 to time t-1 equals t-1

$ P(s_1 = 1) = 1 $ i.e. all trajectories start in state 1

$ P(s_{t+1} = s_t + 1 \mid s_t, a_t) = 1 $ i.e. the next state, given the current state and the current action, is always the current state + 1

As a result, the higher time $ t $, the higher the state $ s_t $, the higher the probability for action $ a_t = 1 $ and the higher the total reward from time $ 0 $ to time $ t-1 $, i.e. the values of $ \pi(a_t \mid s_t) $ and thus the values of $ \nabla_{\theta} \log \pi (a_t \mid s_t) $ are highly correlated with the total rewards from time $ 0 $ to time $ t-1 $, hence respective random variables are not independent of each other.

Why does the "reward to go" trick in policy gradient methods work?

2 Answers2

Linked