For questions about the concept of attention in artificial intelligence and machine learning. Attention-like mechanisms were successfully used in natural language processing and computer vision tasks, such as machine translation. For a review of attention-based mechanism used in NLP, take a look at "Attention in Natural Language Processing" by Andrea Galassi et al.
Questions tagged [attention]
147 questions
26
votes
1 answer
What exactly are the "parameters" in GPT-3's 175 billion parameters and how are they chosen/generated?
When I studied neural networks, parameters were learning rate, batch size etc. But even GPT3's ArXiv paper does not mention anything about what exactly the parameters are, but gives a small hint that they might just be sentences.
Even tutorial…

Nav
- 481
- 1
- 5
- 10
17
votes
1 answer
What is the intuition behind the dot product attention?
I am watching the video Attention Is All You Need by Yannic Kilcher.
My question is: what is the intuition behind the dot product attention?
$$A(q,K, V) = \sum_i\frac{e^{q.k_i}}{\sum_j e^{q.k_j}} v_i$$
becomes:
$$A(Q,K, V) = \text{softmax}(QK^T)V$$

DRV
- 1,573
- 2
- 11
- 18
16
votes
2 answers
Why does GPT-2 Exclude the Transformer Encoder?
After looking into transformers, BERT, and GPT-2, from what I understand, GPT-2 essentially uses only the decoder part of the original transformer architecture and uses masked self-attention that can only look at prior tokens.
Why does GPT-2 not…

Athena Wisdom
- 311
- 2
- 5
15
votes
3 answers
What kind of word embedding is used in the original transformer?
I am currently trying to understand transformers.
To start, I read Attention Is All You Need and also this tutorial.
What makes me wonder is the word embedding used in the model. Is word2vec or GloVe being used? Are the word embeddings trained from…

Bert Gayus
- 545
- 3
- 12
13
votes
2 answers
Is there any artificial intelligence that possesses "concentration"?
Humans can do multiple tasks at the same (e.g. reading while listening to music), but we memorize information from less focused sources with worse efficiency than we do from our main focus or task.
Do such things exist in the case of artificial…

Zoltán Schmidt
- 623
- 7
- 14
11
votes
1 answer
In Computer Vision, what is the difference between a transformer and attention?
Having been studying computer vision for a while, I still cannot understand what the difference between a transformer and attention is?

novice
- 113
- 1
- 4
11
votes
3 answers
Why is dot product attention faster than additive attention?
In section 3.2.1 of Attention Is All You Need the claim is made that:
Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_k}}$. Additive attention computes the compatibility function using a…

user3180
- 598
- 3
- 14
10
votes
1 answer
How does the (decoder-only) transformer architecture work?
How does the (decoder-only) transformer architecture work which is used in impressive models such as GPT-4?

Robin van Hoorn
- 1,810
- 7
- 32
10
votes
2 answers
Why are embeddings added, not concatenated?
Let's consider the following example from BERT
I cannot understand why "the input embeddings are the sum of the token embeddings, the segmentation embeddings, and the position embeddings". The thing is, these embeddings carry different types of…

nalzok
- 251
- 2
- 8
10
votes
1 answer
Why does a transformer not use an activation function following the multi-head attention layer?
I was hoping someone could explain to me why in the transformer model from the "Attention is all you need" paper there is no activation applied after both the multihead attention layer and to the residual connections. It seems to me that there are…

chasep255
- 153
- 1
- 7
10
votes
1 answer
A mathematical explanation of Attention Mechanism
I am trying to understand why attention models are different than just using neural networks. Essentially the optimization of weights or using gates for protecting and controlling cell state (in recurrent networks), should eventually lead to the…

Abhijay Ghildyal
- 203
- 1
- 5
10
votes
1 answer
Why don't people use nonlinear activation functions after projecting the query key value in attention?
Why don't people use nonlinear activation functions after projecting the query key value in attention?
It seems like doing this would lead to much-needed nonlinearity, otherwise, we're just doing linear transformations.
This observation applies to…

user3180
- 598
- 3
- 14
9
votes
2 answers
What is different in each head of a multi-head attention mechanism?
I have a difficult time understanding the "multi-head" notion in the original transformer paper. What makes the learning in each head unique? Why doesn't the neural network learn the same set of parameters for each attention head? Is it because we…

mhsnk
- 113
- 1
- 4
9
votes
3 answers
What is the purpose of Decoder mask (triangular mask) in Transformer?
I'm trying to implement transformer model using this tutorial. In the decoder block of the Transformer model, a mask is passed to "pad and mask future tokens in the input received by the decoder". This mask is added to attention weights.
import…

Uchiha Madara
- 143
- 1
- 7
9
votes
1 answer
What is the intuition behind the attention mechanism?
Attention idea is one of the most influential ideas in deep learning. The main idea behind attention technique is that it allows the decoder to "look back” at the complete input and extracts significant information that is useful in decoding.
I am…

Pluviophile
- 1,223
- 5
- 17
- 37