Which calculation to use for GRU

Question

Im doing trying to implement GRU in my own Neural Network Library but when I did some research i stumbled on some inconsistencies.

When calculating a cell there are as many legitimate resources which state that $\mathbf{h}_t = \mathbf{h}_{t-1}\mathbf{z}_t+\mathbf{r}_t(1-\mathbf{z}_t)$ (e.g. https://arxiv.org/pdf/1803.01686.pdf, https://d2l.ai/chapter_recurrent-modern/gru.html) as there are which state $\mathbf{h}_t = \mathbf{h}_{t-1}(1 - \mathbf{z}_t)+\mathbf{r}_t\mathbf{z}_t$. (e.g. https://arxiv.org/pdf/2107.02248.pdf, http://dprogrammer.org/rnn-lstm-gru) When looking at diagrams there are also inconsistencies between the formulas and the diagrams themself. Also the diagrams differ. Can someone tell me which one it is and why?

score 2 · Accepted Answer · answered Jan 04 '23 at 09:52

Firstly, I am assuming $r$ is the update to the hidden state. I will call it $\tilde{\mathbf{h}}_t$, because this is how it's called in the paper by Cho et al. that introduced the GRU (where $\mathbf{r}_t$ is actually a different vector).

Both formulas do exactly the same thing. The only difference is how you would interpret the role of the vector $\mathbf{z}_t \in (0, 1)^d$. Importantly, because $\mathbf{z}_t$ is sigmoid-activated, it's values are between $0$ and $1$. Thus, you could say that $\mathbf{z}_t$ acts like a continuous switch that controls how much of $\mathbf{h}_{t-1}$ and $\tilde{\mathbf{h}}_t$ ends up in the updated hidden state $\mathbf{h}_t$. This is also called a gate.

In the first formula $\mathbf{h}_t = \mathbf{h}_{t-1}\mathbf{z}_t+\tilde{\mathbf{h}}_t(1-\mathbf{z}_t)$
you could say that $\mathbf{z}_t$ acts like a keep-gate that tells you how much of the old information in the hidden state is preserved. It's negative ($1 - \mathbf{z}_t$), tells you how much the new information $\tilde{\mathbf{h}}_t$ contributes.

In the second formula: $\mathbf{h}_t = \mathbf{h}_{t-1}(1 - \mathbf{z}_t)+\tilde{\mathbf{h}}_t\mathbf{z}_t$
the role of $\mathbf{z}_t$ is inverted and could now be interpreted as an update-gate, controlling how much of old information is replaced by the update $\tilde{\mathbf{h}}_t$.

From an optimization perspective, it doesn't matter whether you optimize the negative or the positive version (i.e. the keep or the update version) of $\mathbf{z}_t$. Therefore, both equations are valid and work equally well. The first formula is actually used in the original paper by Cho et al., where it is called an update gate (I would argue that z, in this case, acts rather like a keep-gate, but in the end its negative controls the portion of the update, so I guess you can also call it that).

Oh okay that makes sense. Thank you for this really good explanation. — Johannes K., Jan 04 '23 at 12:45

Which calculation to use for GRU

1 Answers1