I am looking at this formula which breaks down the gradient of $P(\tau |\theta)$ the first part is clear as is the derivative of $\log(x)$, but I do not see how the first formula is rearranged into the second.
1 Answers
The identity $$\nabla_{\theta} P(\tau \mid \theta) = P(\tau \mid \theta) \nabla_{\theta} \log P(\tau \mid \theta)\tag{1}\label{1},$$
which can also be written as
\begin{align} \nabla_{\theta} \log P(\tau \mid \theta) &= \frac{\nabla_{\theta} P(\tau \mid \theta)}{P(\tau \mid \theta)}\\ &=\frac{1}{P(\tau \mid \theta)} \nabla_{\theta} P(\tau \mid \theta) \end{align}
directly comes from the general rule to derive the logarithm of a function and the chain rule \begin{align} \frac{d \log f(x)}{d x} &= \frac{1}{f(x)} \frac{d f}{dx}. \end{align} Note that $\log f(x)$ is a composite function and that's why we apply the chain rule and that the derivative of $\log x = \frac{1}{x}$, as your text says.
People shouldn't call this a trick. There's no trick here. It's just basic calculus.
Why do you need identity \ref{1}? Because that identity tells you that the derivative of the probability of the trajectory given the parameter $\theta$ with respect to $\theta$ is $P(\tau \mid \theta)$ times the gradient of the logarithm of that same probability. How is this useful? Because the logarithm will turn your product into a sum (and the derivative of a sum is the sum of the derivatives of the elements of the sum), Essentially, the identity \ref{1} will help you to compute the gradient is an easier way (at least, conceptually).

- 39,006
- 12
- 98
- 176
-
Thank you for your answer, it is clear how the derivative of log fits into the equation but not how the first form is rearranged to fit the second form. Are you able to elaborate on this? – Jacob B Apr 26 '20 at 23:30
-
@JacobB The first form is **not** rearranged to fit the second form in your example. The gradient of $P$ is given by your 2nd equation. Your 1st equation only tells you the definition of the probability of the trajectory. Then you can use your 2nd identity to find the gradient of the 1st. Why do you need the 2nd identity? Because there's a log there and your product in the 1st equation will become a sum. – nbro Apr 26 '20 at 23:44
-
I see thanks, I misunderstood the transition between 1&2 it seemed they were building off the last equation. – Jacob B Apr 26 '20 at 23:51
-
In your follow up you state we use the log of probabilities to make our computation easier, but we still have to compute the original probability don't we? So we can't completely avoid a long chain of multiplications. – Jacob B Apr 26 '20 at 23:55
-
@JacobB Multiplications (especially of small numbers) can easily be numerically unstable. For example, if you multiply $0.5$ and $0.5$, you get $0.25$. If you keep multiplying small numbers, you can easily go to zero _with finite-precision numbers_. In your example, if $T$ is a big number, that can easily happen. However, summations do not suffer from this (i.e. $0.5+0.5 > 0.5$). That's another reason to use that trick. But, essentially, you use that trick because computing the gradient of a summation is easier than computing the gradient of a product. – nbro Apr 26 '20 at 23:59
-
Right, but in the formula we don't avoid this unstable multiplication only part of the formula is contained in a log – Jacob B Apr 27 '20 at 00:14
-
The reason for using the log here is indeed to turn a product into a sum. But I don't believe it's for numerical reasons. Instead the reason is that the trajectory likelihood is a product of terms $p_\theta(a_t|s_t)p(s_{t+1}|a_t, s_t)$, with a policy $p_\theta(a_t|s_t)$ with parameters $\theta$. Once taking the log, we get a sum and the derivatives of the dynamics terms $p(s_{t+1}|a_t, s_t)$ with respect to the policy parameters $\theta$ drop out. – Chris Cundy Apr 30 '20 at 05:44