To derive the policy gradient, we start by writing the equation for the probability of a certain trajectory (e.g. see spinningup tutorial):
$$ \begin{align} P_\theta(\tau) &= P_\theta(s_0, a_0, s_1, a_1, \dots, s_T, a_T) \\ & = p(s_0) \prod_{i=0}^T \pi_\theta(a_i | s_i) p(s_{i+1} | s_i, a_i) \end{align} $$
The expression is based on the chain rule for probability. My understanding is that the application of the chain rule should give up this expression:
$$ p(s_0)\prod_{i=0}^T \pi_\theta(a_i|s_i, a_{i-1}, s_{i-1}, a_{i-2}, \dots, s_0, a_0) p(s_{i+1} | s_i, a_i, s_{i-1}, a_{i-1}, \dots, a_0, s_0) $$
Then the Markov property should be applicable, producing the desired equality. This should only depend on the latest state-action pair.
Here are my questions:
Is this true?
I watched this lecture about policy gradients, and at this time during the lecture, Sergey says that: "at no point did we use the Markov property when we derived the policy gradient", which left me confused. I assumed that the initial step of calculating the trajectory probability was using the Markov property.