Why is $\sum_{s} \eta(s)$ a constant of proportionality in the proof of the policy gradient theorem?

Question

In Sutton and Barto's book (http://incompleteideas.net/book/bookdraft2017nov5.pdf), a proof of the policy gradient theorem is provided on pg. 269 for an episodic case and a start state policy objective function (see picture below, last 3 equations).

Why can we assume that the sum $\sum_s\eta(s)$ is a constant of proportionality? Doesn't it also depend on $\theta$, since it depends on the policy $\pi$?

What could make sense, would be to say that $\nabla J(\theta) = \mathbb{E}_{s\sim \eta(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]\propto \mathbb{E}_{s\sim d(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]$.

Since the proportionality constant is always $\ge 0$ (average time spent in an episode), any update direction suggested by $\mathbb{E}_{s\sim d(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]$ is the same as $\mathbb{E}_{s\sim \eta(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]$, but with different amplitude. This, however, wouldn't impact the learning process too much, since we multiply the update term with a low learning rate anyway.

Hence, as it is more easy to sample states from $d(s)$, we just set $\nabla_{\theta} J = \mathbb{E}_{s\sim d(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]$.

Could that serve as plausible explanation?

a relevant question that may help: https://ai.stackexchange.com/questions/36650/in-the-policy-gradient-theorem-proof-why-is-d-pis-sum-k-0-infty-gam/36721#36721 — , Jan 12 '23 at 13:35
Thanks for linking this post! I think what you say in your answer there in the second paragraph ("Moreover, ...") goes in the same direction as what I tried to explain in my own answer suggestion above. So, it doesn't matter if we take samples from $d$ or $\eta$ (or $\rho$ in the linked post), as they represent the same distribution, just scaled by a factor, right? I guess, that taking samples from $d$ is just more intuitive, as sampling from a probability distribution is more intuitive than sampling from something that is not, i.e., $\eta$. — jwl17, Jan 12 '23 at 18:05
The distributions $d$ and $\rho$ are different not just in a scale factor. However, people just do not care the difference or whether the result is rigorously correct. When we sample, rigorously speaking, we should run according to a policy for a long time until we reach the stationary phase. But the practice is not at all due to limited data! Finally, this part is indeed confusing and there are many different objective functions. One simple case is that the distribution is independent of the policy. Then, calculating the gradient would be easier. More details can be found in that book. — , Jan 13 '23 at 01:37
The derivation answering your question is just above the expression on your screenshot. — Kostya, Apr 01 '23 at 21:12

score 0 · Answer 1 · answered Jan 12 '23 at 08:11

Spitballing some ideas here -

as $\pi_\theta$ is updated using some gradient optimization process with small steps in the direction of the gradient, the changes to state visits are generally small as well. If so, maybe we can assume that the state-visitation frequency $\eta(s)$ will not change significantly. To be precise, while $\eta(s)$ is in fact $\eta(s,\theta)$, is it a valid assumption that when the step size is small enough, $\eta(s,\theta_t)\approx\eta(s,\theta_{t+1})$ for all $t$.

This may bring to mind a Quasistatic process, which is a thermodynamic process that happens slowly enough for the system to remain in equilibrium, but hey thats just my intuition

Thanks for your answer! Hmm, maybe that argument applies better to $\sum_s\eta(s)$. I read that $\eta(s)$ can be interpreted as the average time you spend in a state in one episode, which makes sense to me. Therefore, $\sum_s\eta(s)$ can be interpreted as the average time you spend in the episode. Maybe that doesn't change much in the sum, but the single $\eta(s)$ change more strongly. I am just wondering whether there is a strict mathematical argument for this. The book gives me the impression there is, but I didn't it in there nor by looking up other posts. — jwl17, Jan 12 '23 at 09:25

pythonic833 · Answer 2 · 2023-02-14T12:54:53.053

The answer is: $\sum_{s} \eta(s)$ is not a constant with regards to $\theta$. As you already mentioned \begin{equation} \sum_{k=0}^{\infty} \text{Pr}(s_{0}\rightarrow s, k, \pi) = \eta(s). \end{equation} But then the question arises: Why is the derivation you showed correct? It's not the point that $\sum_{s} \eta(s)$ is constant in regards to $\theta$ and is therefore not affected by the $\nabla$ operator but quite the contrary: That it is the result of applying the $\nabla$ operator to the state value function. If you just look at the Proof of the Policy Gradient Theorem (episodic case) in Sutton-Barto at the simplification of $\nabla v_{\pi}(s)$ you will find that (not showing all steps here): \begin{align} \nabla v_{\pi}(s) &= \nabla \left[ \sum_{a} \pi(a|s)q_{\pi}(s, a) \right] &\text{for all s} \in S \\ &=~... \\ &= \sum_{a}\left[\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \sum_{s'}p(s'|s,a)\nabla v_{\pi}(s')\right] & (\text{using recursivity}) \\ &=\sum_{a}\left[\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \sum_{s'}p(s'|s,a) \\ \sum_{a'}[\nabla \pi(a'|s')q_{\pi}(s', a') + \pi(a'|s')\sum_{s''}p(s''|s',a')\nabla v_{\pi}(s'')]\right] & (\text{further unrolling}) \\ &= \sum_{x \in S}\sum_{k=0}^{\infty} \text{Pr}(s\rightarrow x, k, \pi)\sum_{a}\nabla \pi(a|x)q_{\pi}(x,a). \end{align} Where $\text{Pr}(s\rightarrow x, k, \pi)$ is the probability of moving from state $s$ to state $x$ accounting for all intermediate states $s_{1}, s_{2}, ..., s_{k}$ by following policy $\pi$. So we see, that this equation from which you start is the result NOT the target of a derivative.

Why is $\sum_{s} \eta(s)$ a constant of proportionality in the proof of the policy gradient theorem?

2 Answers2