2

I am currently implementing the very basic version (REINFORCE) of the Monte Carlo policy gradient algorithm. I was wondering if this is the correct gradient for the log of softmax.

\begin{align} \nabla_{\theta} \log \pi_{\theta}(s, a) &= \varphi(s, a)-\mathbb{E}\left[\varphi(s, a)_{\forall a \in A}\right] \\ &= \left(\varphi(s)^T \cdot \theta_{a}\right)-\sum_{\forall a \in A}\left(\varphi(s)^T \cdot \theta_{a}\right) \end{align}

where $\varphi(s)$ is the feature vector at state $s$.

I am not sure if my interpretation of the equation is correct. I ask because, in my implementation, my weights ($\theta$) blow up after a few iterations, and I have a feeling the problem is in this line.

0 Answers0