In the TRPO paper, the objective to maximize is (equation 14) $$ \mathbb{E}_{s\sim\rho_{\theta_\text{old}},a\sim q}\left[\frac{\pi_\theta(a|s)}{q(a|s)} Q_{\theta_\text{old}}(s,a) \right] $$
which involves an expectation over states sampled with some density $\rho$, itself defined as $$ \rho_\pi(s) = P(s_0 = s)+\gamma P(s_1=s) + \gamma^2 P(s_2=s) + \dots $$
This seems to suggest that later timesteps should be sampled less often than earlier timesteps, or equivalently sampling states uniformly in trajectories but adding an importance sampling term $\gamma^t$.
However, the usual implementations simply use batches made of truncated or concatenated trajectories, without any reference to the location of the timesteps in the trajectory.
This is similar to what can be seen in the PPO paper, which transforms the above objective into (equation 3) $$ \mathbb{E}_t \left[ \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)} \hat A_t \right] $$
It seems that something is missing in going from $\mathbb{E}_{s\sim \rho}$ to $\mathbb{E}_t$ in the discounted setting. Are they really equivalent?