3

To derive the policy gradient, we start by writing the equation for the probability of a certain trajectory (e.g. see spinningup tutorial):

$$ \begin{align} P_\theta(\tau) &= P_\theta(s_0, a_0, s_1, a_1, \dots, s_T, a_T) \\ & = p(s_0) \prod_{i=0}^T \pi_\theta(a_i | s_i) p(s_{i+1} | s_i, a_i) \end{align} $$

The expression is based on the chain rule for probability. My understanding is that the application of the chain rule should give up this expression:

$$ p(s_0)\prod_{i=0}^T \pi_\theta(a_i|s_i, a_{i-1}, s_{i-1}, a_{i-2}, \dots, s_0, a_0) p(s_{i+1} | s_i, a_i, s_{i-1}, a_{i-1}, \dots, a_0, s_0) $$

Then the Markov property should be applicable, producing the desired equality. This should only depend on the latest state-action pair.

Here are my questions:

  1. Is this true?

  2. I watched this lecture about policy gradients, and at this time during the lecture, Sergey says that: "at no point did we use the Markov property when we derived the policy gradient", which left me confused. I assumed that the initial step of calculating the trajectory probability was using the Markov property.

nbro
  • 39,006
  • 12
  • 98
  • 176
Gerges
  • 131
  • 3
  • in the first equation you show, you definitely are using the Markov Property. When deriving the policy gradient you don’t _explicitly_ use the Markov property, if you refer to the Sutton and Barto derivation this is clear. However, I am still of the opinion that the Markov property is used, as the underlying assumption of the MDP is that the Markov property holds (eg our policy is only conditioned on the current state, not the whole trajectory) – David Dec 21 '20 at 00:27
  • I watched the videos you’re watching and I would argue that the way he derives the policy gradient definitely does use the Markov property — he directly uses the probability of the trajectory that you have in your question and the LHS = RHS only if you assume the Markov property holds, otherwise as you say you would end up with a term like the second equation you have written. – David Dec 21 '20 at 00:58
  • I still have to read the policy gradient chapter in Sutton & Barto. Maybe comparing the two derivation can clear it up. I wonder if conditioning on policy is the answer, so it should be $P(\tau | \pi)$. I just realized thats how its written in the spinning up docs. – Gerges Dec 21 '20 at 01:19
  • 2
    I don’t know what they mean by ‘conditioning on a policy’ — that is analogous to conditioning on a density function. – David Dec 21 '20 at 01:23

0 Answers0