How does Seq2Seq with attention actually use the attention (i.e. the context vector)?

Question

For neural machine translation, there's this model "Seq2Seq with attention", also known as the "Bahdanau architecture" (a good image can be found on this page), where instead of Seq2Seq's encoder LSTM passing a single hidden vector $\vec h[T]$ to the decoder LSTM, the encoder makes all of its hidden vectors $\vec h[1] \dots \vec h[T]$ available and the decoder computes weights $\alpha_i[t]$ with each iteration -- by comparing the decoder's previous hidden state $\vec s[t-1]$ to each encoder hidden state $\vec h[i]$ -- to decide which of those hidden vectors are the most valuable. These are then added together to get a single "context vector" $\vec c[t] = \alpha_1[t]\,\vec h[1] + \alpha_2[t]\,\vec h[2]+\dots +\alpha_T[t]\,\vec h[T]$, which supposedly functions as Seq2Seq's single hidden vector.

But the latter can't be the case. Seq2Seq originally passed that vector to the decoder as initialisation for its hidden state. Evidently, you can only initialise it once. So then, how is $\vec c[t]$ used by the decoder? None of the sources I have read (see e.g. the original paper linked above, or this article, or this paper, or this otherwise excellent reader) disclose what happens. At most, they hide this mechanism behind "a function $f$" which is never explained. I must be overlooking something super obvious here, apparently.

Hello, can you check this post I just created and lmk which part of it you would like to further break down to get a clear answer? https://datascience.stackexchange.com/a/115015/102852 — hH1sG0n3, Oct 08 '22 at 12:13
@hH1sG0n3 Like all of the other sources I linked to, you calculate the attention weights and the context vector, but you never show how to **use** the context vector, only how to **get** it. — Mew, Oct 08 '22 at 13:21

score 1 · Answer 1 · answered Oct 24 '22 at 10:20

1

Evidently you can only initialize it ($\vec{c_t}$) once

As I see it, $\vec{c_t}$ depends on $\vec{h}[1] \ldots \vec{h}[n]$ AND $\vec{s_{t-1}}$ (because $\alpha_i[t]$ depend on $\vec{s_{t-1}}$, and so is different for every calculation of a new $\vec{s_t}$).

And so $\vec{c_t}$ is different for every $\vec{s_t}$.

It can be used by the decoder by e.g., concatenating it with $\vec{s_{t-1}}$, or with $\vec{y_{t-1}}$, or by adding instead of concatenating, or ...

answered Oct 24 '22 at 10:20

Nathan

11
1

Your clarification of `it` is incorrect. `it` in that sentence stands for $\vec s$, the decoder hidden state. See the preceding sentence: *(...) the decoder as initialisation for its hidden state. Evidently, you can only initialise it once.* I also indicated that $\alpha_i[t]$ is time-dependent, both with the $[t]$ parameter and by stating *(...) the decoder computes weights $\alpha_i[t]$ with each iteration -- by comparing the decoder's previous hidden state $\vec s[t−1]$ (...)*. Do you have a source for your last paragraph? It's the usage of $\vec c[t]$ I'm looking for (hence the title). – Mew Oct 24 '22 at 10:46
Right, I misinterpreted what you meant by 'it'. It seems to me you can only initialize $\vec{s_0}$ once, sure, but for the calculation of each subsequent $\vec{s_t}$, a different context vector $\vec{c_t}$ is used, combining the (fixed) encoder-hidden-states $\vec{h}[1] \ldots \vec{h}[n] $ with different $\alpha$ weights. – Nathan Oct 24 '22 at 11:01
For the last paragraph, this is just to illustrate how it could be done. However, it seems that the appendix in the Bahdanau papers shows the details of how $\vec{c_t}$ is used: https://arxiv.org/pdf/1409.0473.pdf#page=12, apparently roughly by adding (after projection) it to (projected versions of) $\vec{s_{i-1}}$ and $\vec{y_{i-1}}$ . – Nathan Oct 24 '22 at 11:03

score 0 · Answer 2 · answered Oct 08 '22 at 15:36

0

If our similarity function is defined as

$$e^{t,t'}= f(y^t,h^{t'})$$ for a $(t, t')$ pair, this would give you an attention weight/score for each hidden state $h^{t'}$. In this way you end up with $[0, ..., t-1]$ weighted hidden states, which you combine using a sum. This weighted sum is the "context vector" $$ c_t = \sum_{i=0}^{T}a_{t,i}h_i$$

To answer your question, for each position you end up with a context vector $c_t$ which fully replaces the hidden state $h^{t}$ at the exact same position of your computation graph.

$$ h_t \rightarrow c_t,\ as\ \ LSTM_{simple}\ \rightarrow\ LSTM_{attentive} $$

Hope it helps!

References:
- Neural Machine Translation by Jointly Learning to Align and Translate, 2014 \

answered Oct 08 '22 at 15:36

hH1sG0n3

191
6

1

This seems unlikely to me. What is this $h_t$ at the end of your post? It isn't an encoder state, so you must mean a decoder state $s_t$. Yet, Bahdanau on page 3 writes that $s_i = f(s_{i−1}, y_{i−1}, c_i)$, which suggests to me that $s_{t-1} \neq c_{t-1}$, otherwise Bahdanau would have written $s_i = f(c_{i−1}, y_{i−1}, c_i)$. – Mew Oct 08 '22 at 19:23
Let me take a step back. In simple LSTM, can you define the difference between $h_t$ and $y_t$, so that i understand your pov better. – hH1sG0n3 Oct 12 '22 at 12:34
It doesn't matter how I define these things. That equation is a literal quote from the Bahdanau paper linked in my question. It also doesn't matter whether it's with or without attention, because the Seq2Seq decoder LSTM feeds its own output $y_t$ back to its input regardless. $h_t$ are the encoder's hidden states. $s_t$ are the decoder's hidden states. – Mew Oct 12 '22 at 17:44

How does Seq2Seq with attention actually use the attention (i.e. the context vector)?

2 Answers2