In Sutton and Barto's book (http://incompleteideas.net/book/bookdraft2017nov5.pdf), a proof of the policy gradient theorem is provided on pg. 269 for an episodic case and a start state policy objective function (see picture below, last 3 equations).
Why can we assume that the sum $\sum_s\eta(s)$ is a constant of proportionality? Doesn't it also depend on $\theta$, since it depends on the policy $\pi$?
What could make sense, would be to say that $\nabla J(\theta) = \mathbb{E}_{s\sim \eta(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]\propto \mathbb{E}_{s\sim d(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]$.
Since the proportionality constant is always $\ge 0$ (average time spent in an episode), any update direction suggested by $\mathbb{E}_{s\sim d(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]$ is the same as $\mathbb{E}_{s\sim \eta(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]$, but with different amplitude. This, however, wouldn't impact the learning process too much, since we multiply the update term with a low learning rate anyway.
Hence, as it is more easy to sample states from $d(s)$, we just set $\nabla_{\theta} J = \mathbb{E}_{s\sim d(s), a \sim \pi}\left[\nabla_{\theta}\mathrm{log}\left(\pi(s,a,\theta)\right)\,q_{\pi}(s,a)\right]$.
Could that serve as plausible explanation?