How does the TRPO surrogate loss account for the error in the policy?

Asked May 02 '19 at 15:31

Active Jan 21 '23 at 17:23

Viewed 128 times

In the Trust Region Policy Optimization (TRPO) paper, on page 10, it is stated

An informal overview is as follows. Our proof relies on the notion of coupling, where we jointly define the policies $\pi$ and $\pi'$so that they choose the same action with high probability $(1−\alpha)$. Surrogate loss $L_π(\hat\pi)$ accounts for the the advantage of $\hat\pi$ the first time that it disagrees with $\pi$, but not subsequent disagreements. Hence, the error in $L_\pi$ is due to two or more disagreements between $\pi$ and $\hat\pi$, hence, we get an $O(\alpha^2)$ correction term, where $\alpha$ is the probability of disagreement.

I don't see how this holds? In what way does $L_\pi$ account for one disagreement? Surely when $\hat\pi$ disagrees with $\pi$ you will have different trajectories under expectation for each so then $L_\pi$ immediately is different from $\eta$ ?

I understand the proof given, but I wanted to try and capture this intuition.

edited Jan 21 '23 at 17:23

nbro

39,006
12
98
176

asked May 02 '19 at 15:31

olliejday

How does the TRPO surrogate loss account for the error in the policy?

0 Answers0