BlackOut - ICLR 2016: need help understanding the cost function derivative

Question

In the ICLR 2016 paper BlackOut: Speeding up Recurrent Neural Network Language Models with very Large Vocabularies, on page 3, for eq. 4:

$$ J_{ml}^s(\theta) = log \ p_{\theta}(w_i | s) $$

They have shown the gradient computation in the subsequent eq. 5: $$ \frac{\partial J_{ml}^s(\theta)}{\partial \theta} = \frac{\partial}{\partial \theta}<\theta_i \cdot s> - \sum_{j=1}^V p_{\theta}(w_j|s)\frac{\partial}{\partial \theta} <\theta_j \cdot s>$$

I am not able to understand how they have obtained this - I have tried to work it out as follows:

from eq. 3 we have

$$ p_{\theta}(w_i|s) = \frac{exp(<\theta_i \cdot s>)}{\sum_{j=1}^V exp(<\theta_j \cdot s>)} $$

re-writing eq. 4, we have:

$$\begin{eqnarray} J_{ml}^s(\theta) &=& log \ \frac{exp(<\theta_i \cdot s>)}{\sum_{j=1}^V exp(<\theta_j \cdot s>)} \nonumber \\ &=& log \ exp(<\theta_i \cdot s>) - log \ \sum_{j=1}^V exp(<\theta_j \cdot s>) \nonumber \nonumber \end{eqnarray}$$

Now, taking derivatives w.r.t. $ \theta $:

$$\begin{eqnarray} \frac{\partial}{\partial \theta} J_{ml}^s(\theta) &=& \frac{\partial}{\partial \theta} log \ exp(<\theta_i \cdot s>) - \frac{\partial}{\partial \theta} log \ \sum_{j=1}^V exp(<\theta_j \cdot s>) \nonumber \nonumber \end{eqnarray}$$

So, that's it; the second term (after the negative sign), how did that change to the term they have given in eq. 5? Or did I commit a blunder?

Update

I did commit a blunder and I have edited it out, but, the question remains!

correct property: $$log \ (\prod_{i=1}^K x_i) = \sum_{i=1}^K log \ (x_i)$$

BlackOut - ICLR 2016: need help understanding the cost function derivative

0 Answers0