3

In the TRPO paper, the objective to maximize is (equation 14) $$ \mathbb{E}_{s\sim\rho_{\theta_\text{old}},a\sim q}\left[\frac{\pi_\theta(a|s)}{q(a|s)} Q_{\theta_\text{old}}(s,a) \right] $$

which involves an expectation over states sampled with some density $\rho$, itself defined as $$ \rho_\pi(s) = P(s_0 = s)+\gamma P(s_1=s) + \gamma^2 P(s_2=s) + \dots $$

This seems to suggest that later timesteps should be sampled less often than earlier timesteps, or equivalently sampling states uniformly in trajectories but adding an importance sampling term $\gamma^t$.

However, the usual implementations simply use batches made of truncated or concatenated trajectories, without any reference to the location of the timesteps in the trajectory.

This is similar to what can be seen in the PPO paper, which transforms the above objective into (equation 3) $$ \mathbb{E}_t \left[ \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)} \hat A_t \right] $$

It seems that something is missing in going from $\mathbb{E}_{s\sim \rho}$ to $\mathbb{E}_t$ in the discounted setting. Are they really equivalent?

nbro
  • 39,006
  • 12
  • 98
  • 176
udscbt
  • 31
  • 2

1 Answers1

1

As you point out, they are not equivalent. I guess you could store the time index for each state visited, but there are two problems with this.

First, if you sample states according to their time index, sampling from the replay memory will become more cumbersome and probably much slower (you'd have to sample the time index and then a specific state with that time index). This is definitely something undesirable. If you choose to add the importance sampling term, then you will make really small most of the terms if $\gamma$ is not so close to 1.

Second, while the original objective is very nice to obtain theoretical results, you may ask yourself if that objective is really what you want to maximize. Do you really care more about the performance at the beginning of the trajectory than at the end?

While I have no rigorous proofs, I like to think that a better definition for the expected discounted reward is a time average:

$$\eta(\pi) = \lim_{N\to\infty}\frac{1}{N}\sum_{k=1}^N\mathbb E_{s_k,a_k,s_{k+1},\cdots}\left[\sum_{t=k}^\infty\gamma^{t-k}r(s_t)\right].$$

Since each term of the average satisfies the original proofs, the only difference with the equation 14 is that the density is $\tilde\rho_{\theta_{old}}(s) = \lim_{N\to\infty}\frac{1}{N}\sum_{k=1}^N\rho_{k,\theta_{old}}(s)$, where $$\rho_{k,\theta_{old}} = P(s_k = s)+\gamma P(s_{k+1}=s) + \gamma^2 P(s_{k+2}=s) + \dots.$$ You can notice that for a large $N$ you basically give the same importance to all of the time-steps. So, the usual implementations actually optimize some complicated function similar to the one I defined, where you give a similar weight to all samples, irrespective of their corresponding time-step.

Diego Gomez
  • 395
  • 4
  • 14