Why is baseline conditional on state at some timestep unbiased?

Question

In the homework for the Berkeley RL class, problem 1, it asks you to show that the policy gradient is still unbiased if the baseline subtracted is a function of the state at time step $t$.

$$ \triangledown _\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t)} [b(s_t)] = 0 $$

I am struggling through what the first step of such a proof might be.

Can someone point me in the right direction? My initial thought was to somehow use the law of total expectation to make the expectation of $b(s_t)$ conditional on $T$, but I am not sure.

Don't really have much time to write down the exact equations and format it (maybe later if it is still not answered) with LaTeX but here is a hint. You want to have that the sum does not depend on the policy so that the derivative would be 0. So you somehow try to express things using the policy p(s,a). The answer btw can also be found in Sutton's RL Intro book in the policy gradient chapter. — Hai Nguyen, Sep 10 '18 at 09:51
Thank you very much! I will use that hint to get started, as well as thank you for telling me about it being in Sutton RL. I am reading that book and it is quite excellent! — Laura C, Sep 10 '18 at 10:51
@LauraC could you please elaborate how you arrived at this formulation? Because in the homework (at least as of now), the original equation is different and I don't know how to get to your version. Thanks! — Max Semikin, Oct 07 '18 at 12:14

score 7 · Answer 1 · edited Jun 10 '20 at 16:42

Using the law of iterated expectations one has:

$\triangledown _\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t)} [b(s_t)] = \nabla_\theta \sum_{t=1}^T \mathbb{E}_{s_t \sim p(s_t)} \left[ \mathbb{E}_{a_t \sim \pi_\theta(a_t | s_t)} \left[ b(s_t) \right]\right] =$

written with integrals and moving the gradient inside (linearity) you get

$= \sum_{t=1}^T \int_{s_t} p(s_t) \left(\int_{a_t} \nabla_\theta b(s_t) \pi_\theta(a_t | s_t) da_t \right)ds_t =$

you can now move $\nabla_\theta$ (due to linearity) and $b(s_t)$ (does not depend on $a_t$) form the inner integral to the outer one:

$= \sum_{t=1}^T \int_{s_t} p(s_t) b(s_t) \nabla_\theta \left(\int_{a_t} \pi_\theta(a_t | s_t) da_t \right)ds_t= $

$\pi_\theta(a_t | s_t)$ is a (conditional) probability density function, so integrating over all $a_t$ for a given fixed state $s_t$ equals $1$:

$= \sum_{t=1}^T \int_{s_t} p(s_t) b(s_t) \nabla_\theta 1 ds_t = $

Now $\nabla_\theta1 = 0$, which concludes the proof.

Douglas Daseeco · Answer 2 · 2018-09-22T04:30:41.657

It appears that the homework was due two days prior to this answer's writing, but in case it is still relevant in some way, the relevant class notes (which would have been useful if provided in the question along with the homework) are here.

The first instance of expectation placed on the student is, "Please show equation 12 by using the law of iterated expectations, breaking $\mathbb{E}_{\tau \sim p \theta(\tau)}$ by decoupling the state-action marginal from the rest of the trajectory." Equation 12 is this.

$\sum_{t = 1}^{T} E_{\tau \sim p \theta(\tau)} [\nabla_\theta \log \pi_\theta(a_t|s_t)(b(s_t))] = 0$

The class notes identifies $\pi_\theta(a_t|s_t)$ as the state-action marginal. It is not a proof sought, but a sequence of algebraic steps to perform the decoupling and show the degree to which independence of the state-action marginal can be achieved.

This exercise is a preparation for the next step in the homework and draws only on the review of CS189, Burkeley's Introduction to Machine Learning course, which does not contain the Law of Total Expectation in its syllabus or class notes.

All the relevant information is in the above link for class notes and requires only intermediate algebra.

Why is baseline conditional on state at some timestep unbiased?

2 Answers2