What is the intuition behind the calculation of the similarity between encoder and decoder states?

Question

Suppose that we are doing machine translation. We have a conditional language model with attention where we are are trying to predict a sequence $y_1, y_2, \dots, y_J$ from $x_1, x_2, \dots x_I$: $$P(y_1, y_2, \dots, y_{J}|x_1, x_2, \dots x_I) = \prod_{j=1}^{J} p(y_j|v_j, y_1, \dots, y_{j-1})$$ where $v_j$ is a context vector that is different for each $y_j$. Using an RNN with a encoder-decoder structure, each element $x_i$ of the input sequence and $y_j$ of the output sequence is converted into an embedding $h_i$ and $s_j$ respectively: $$h_i = f(h_{i-1}, x_i) \\ s_j = g(s_{j-1},[y_{j-1}, v_j])$$ where $f$ is some function of the previous input state $h_{i-1}$ and the current input word $x_i$ and $g$ is some function of the previous output state $s_{j-1}$, the previous output word $y_{j-1}$ and the context vector $v_j$.

Now, we want the process of predicting $s_j$ to "pay attention" to the correct parts of the encoder states (context vector $v_j$). So: $$v_j = \sum_{i=1}^{I} \alpha_{ij} h_i$$ where $\alpha_{ij}$ tells us how much weight to put on the $i^{th}$ state of the source vector when predicting the $j^{th}$ word of the output vector. Since we want the $\alpha_{ij}$s to be probabilities, we use a softmax function on the similarities between the encoder and decoder states: $$\alpha_{ij} = \frac{\exp(\text{sim}(h_i, s_{j-1}))}{\sum_{i'=1}^{I} \exp(\text{sim}(h_i, s_{j-1}))}$$

Now, in additive attention, the similarities of the encoder and decoder states are computed as: $$\text{sim}(h_i, s_{j}) = \textbf{w}^{T} \text{tanh}(\textbf{W}_{h}h_{i} +\textbf{W}_{s}s_{j})$$

where $\textbf{w}$, $\textbf{W}_{h}$ and $\textbf{W}_{s}$ are learned attention parameters using a one-hidden layer feed-forward network.

What is the intuition behind this definition? Why use the $\text{tanh}$ function? I know that the idea is to use one layer of a neural network to predict the similarities.

Added. This description of machine translation/attention is based on the Coursera course Natural Language Processing.

I would first like to note that $\tanh$ produces numbers in the range $[-1, 1]$. So, you are doing a dot product between $w^T$ and a vector of numbers in the range $[-1, 1]$. Recall also that if you multiply any number $x$ by another number $y$ between $0$ and $1$, the result is a smaller number than $x$. — nbro, Feb 09 '19 at 01:25
This is a very good question. I am surprised no one answered it. — Long, Sep 13 '22 at 07:41

What is the intuition behind the calculation of the similarity between encoder and decoder states?

0 Answers0