In the Trust Region Policy Optimization (TRPO) paper, on page 10, it is stated
An informal overview is as follows. Our proof relies on the notion of coupling, where we jointly define the policies $\pi$ and $\pi'$so that they choose the same action with high probability $(1−\alpha)$. Surrogate loss $L_π(\hat\pi)$ accounts for the the advantage of $\hat\pi$ the first time that it disagrees with $\pi$, but not subsequent disagreements. Hence, the error in $L_\pi$ is due to two or more disagreements between $\pi$ and $\hat\pi$, hence, we get an $O(\alpha^2)$ correction term, where $\alpha$ is the probability of disagreement.
I don't see how this holds? In what way does $L_\pi$ account for one disagreement? Surely when $\hat\pi$ disagrees with $\pi$ you will have different trajectories under expectation for each so then $L_\pi$ immediately is different from $\eta$ ?
I understand the proof given, but I wanted to try and capture this intuition.