In the ICLR 2016 paper BlackOut: Speeding up Recurrent Neural Network Language Models with very Large Vocabularies, on page 3, for eq. 4:
$$ J_{ml}^s(\theta) = log \ p_{\theta}(w_i | s) $$
They have shown the gradient computation in the subsequent eq. 5: $$ \frac{\partial J_{ml}^s(\theta)}{\partial \theta} = \frac{\partial}{\partial \theta}<\theta_i \cdot s> - \sum_{j=1}^V p_{\theta}(w_j|s)\frac{\partial}{\partial \theta} <\theta_j \cdot s>$$
I am not able to understand how they have obtained this - I have tried to work it out as follows:
from eq. 3 we have
$$ p_{\theta}(w_i|s) = \frac{exp(<\theta_i \cdot s>)}{\sum_{j=1}^V exp(<\theta_j \cdot s>)} $$
re-writing eq. 4, we have:
$$\begin{eqnarray} J_{ml}^s(\theta) &=& log \ \frac{exp(<\theta_i \cdot s>)}{\sum_{j=1}^V exp(<\theta_j \cdot s>)} \nonumber \\ &=& log \ exp(<\theta_i \cdot s>) - log \ \sum_{j=1}^V exp(<\theta_j \cdot s>) \nonumber \nonumber \end{eqnarray}$$
Now, taking derivatives w.r.t. $ \theta $:
$$\begin{eqnarray} \frac{\partial}{\partial \theta} J_{ml}^s(\theta) &=& \frac{\partial}{\partial \theta} log \ exp(<\theta_i \cdot s>) - \frac{\partial}{\partial \theta} log \ \sum_{j=1}^V exp(<\theta_j \cdot s>) \nonumber \nonumber \end{eqnarray}$$
So, that's it; the second term (after the negative sign), how did that change to the term they have given in eq. 5? Or did I commit a blunder?
Update
I did commit a blunder and I have edited it out, but, the question remains!
correct property: $$log \ (\prod_{i=1}^K x_i) = \sum_{i=1}^K log \ (x_i)$$