Maximizing or Minimizing in Trust Region Policy Optimization?

Question

I happened to discover that the v1 (19 Feb 2015) and the v5 (20 Apr 2017) versions of TRPO papers have two different conclusions. The Equation (15) in v1 is $\min_\theta$ while the Equation (14) in v2 is $\max_\theta$. So, I'm a little bit confused about which one to choose.

BTW, I found that in the High-Dimensional Continuous Control Using Generalized Advantage Estimation, the Equation (31) uses $\min_\theta$.

Dennis Soemers · Accepted Answer · 2018-08-11T17:34:32.210

The differences you have observed between the two different versions of the TRPO paper are due to different formalizations of the problem and the objective.

In the first version of the paper you linked, they start out in Section 2 by defining Markov Decision Processes (MDPs) as tuples that, among other things, have a cost function $c : \mathcal{S} \rightarrow \mathbb{R}$. They define $\eta(\pi)$ as the expected discounted cost of a policy $\pi$, and subsequently also define state-action value functions $Q_{\pi}(s_t, a_t)$, value functions $V_{\pi}(s_t)$, and advantage functions $A_{\pi}(s, a)$ in terms of costs. Ultimately, in Equation 15, they write the following:

\begin{aligned} \underset{\theta}{\text{minimize }} & \mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}, a \sim q} \left[ \frac{\pi_{\theta}(a \vert s)}{q(a \vert s)} Q_{\theta_{\text{old}}}(s, a) \right] \\ \text{subject to } & \mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}} \left[ D_{KL}(\pi_{\theta_{\text{old}}}(\cdot \vert s) ~ \vert \vert ~ \pi_{\theta}(\cdot \vert s)) \right] \leq \delta \end{aligned}

Now, there's a lot going on there, but we can very informally "simplify" it to only the parts that are relevant for this question, as follows:

$$\underset{\theta}{\text{minimize }} \mathbb{E} \left[ Q(s, a) \right]$$

When we look at just that, we see that we're pretty much trying to minimize $Q$-values, which are costs; that makes sense, typically costs are things we want to minimize.

In the second version of the paper you linked, they have changed the Preliminaries in Section 2. Now they no longer have a cost function $c$ in their definition of an MDP; they have replaced it by a reward function $r : \mathcal{S} \rightarrow \mathbb{R}$. Then they move on to define $\eta(\pi)$ as the expected discounted reward (rather than expected discounted cost), and also define $Q$, $V$ and $A$ in terms of rewards rather than costs. This now all matches the standard, common terminology in Reinforcement Learning.

Ultimately, Equation 14 looks identical to what we saw above, it's again about an expectation of $Q$-values. But, now $Q$-values are rewards rather than costs. Rewards are generally things we want to maximize, rather than minimize, so that's why the objective swapped around.

Maximizing or Minimizing in Trust Region Policy Optimization?

1 Answers1