Highest Voted 'attention' Questions - Artificial Intelligence Stack Exchange

26

votes

1 answer

What exactly are the "parameters" in GPT-3's 175 billion parameters and how are they chosen/generated?

When I studied neural networks, parameters were learning rate, batch size etc. But even GPT3's ArXiv paper does not mention anything about what exactly the parameters are, but gives a small hint that they might just be sentences. Even tutorial…

asked Jul 26 '20 at 08:12

Nav

481
1
5
10

17

votes

1 answer

What is the intuition behind the dot product attention?

I am watching the video Attention Is All You Need by Yannic Kilcher. My question is: what is the intuition behind the dot product attention? $$A(q,K, V) = \sum_i\frac{e^{q.k_i}}{\sum_j e^{q.k_j}} v_i$$ becomes: $$A(Q,K, V) = \text{softmax}(QK^T)V$$

natural-language-processing papers transformer attention bert

asked Apr 11 '20 at 12:53

DRV

1,573
2
11
18

16

votes

2 answers

Why does GPT-2 Exclude the Transformer Encoder?

After looking into transformers, BERT, and GPT-2, from what I understand, GPT-2 essentially uses only the decoder part of the original transformer architecture and uses masked self-attention that can only look at prior tokens. Why does GPT-2 not…

natural-language-processing transformer attention bert gpt

asked Mar 27 '21 at 19:55

Athena Wisdom

311
2
5

15

votes

3 answers

What kind of word embedding is used in the original transformer?

I am currently trying to understand transformers. To start, I read Attention Is All You Need and also this tutorial. What makes me wonder is the word embedding used in the model. Is word2vec or GloVe being used? Are the word embeddings trained from…

natural-language-processing transformer attention word-embedding

asked Feb 05 '21 at 18:51

Bert Gayus

545
3
12

13

votes

2 answers

Is there any artificial intelligence that possesses "concentration"?

Humans can do multiple tasks at the same (e.g. reading while listening to music), but we memorize information from less focused sources with worse efficiency than we do from our main focus or task. Do such things exist in the case of artificial…

neural-networks reference-request attention

asked Aug 06 '16 at 22:59

Zoltán Schmidt

623
7
14

11

votes

1 answer

In Computer Vision, what is the difference between a transformer and attention?

Having been studying computer vision for a while, I still cannot understand what the difference between a transformer and attention is?

computer-vision comparison transformer attention

asked Jul 25 '21 at 04:01

novice

113
1
4

11

votes

3 answers

Why is dot product attention faster than additive attention?

In section 3.2.1 of Attention Is All You Need the claim is made that: Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_k}}$. Additive attention computes the compatibility function using a…

neural-networks machine-learning deep-learning attention

asked Apr 17 '19 at 04:38

user3180

598
3
14

10

votes

1 answer

How does the (decoder-only) transformer architecture work?

How does the (decoder-only) transformer architecture work which is used in impressive models such as GPT-4?

deep-learning transformer attention gpt large-language-models

asked Apr 23 '23 at 19:28

Robin van Hoorn

1,810
7
32

10

votes

2 answers

Why are embeddings added, not concatenated?

Let's consider the following example from BERT I cannot understand why "the input embeddings are the sum of the token embeddings, the segmentation embeddings, and the position embeddings". The thing is, these embeddings carry different types of…

neural-networks deep-learning transformer attention embeddings

asked Jun 20 '22 at 05:57

nalzok

251
2
8

10

votes

1 answer

Why does a transformer not use an activation function following the multi-head attention layer?

I was hoping someone could explain to me why in the transformer model from the "Attention is all you need" paper there is no activation applied after both the multihead attention layer and to the residual connections. It seems to me that there are…

transformer attention

asked Aug 24 '21 at 09:55

chasep255

153
1
7

10

votes

1 answer

A mathematical explanation of Attention Mechanism

I am trying to understand why attention models are different than just using neural networks. Essentially the optimization of weights or using gates for protecting and controlling cell state (in recurrent networks), should eventually lead to the…

neural-networks machine-learning deep-learning recurrent-neural-networks attention

asked May 15 '19 at 00:27

Abhijay Ghildyal

203
1
5

10

votes

1 answer

Why don't people use nonlinear activation functions after projecting the query key value in attention?

Why don't people use nonlinear activation functions after projecting the query key value in attention? It seems like doing this would lead to much-needed nonlinearity, otherwise, we're just doing linear transformations. This observation applies to…

neural-networks attention transformer

asked May 03 '19 at 03:15

user3180

598
3
14

9

votes

2 answers

What is different in each head of a multi-head attention mechanism?

I have a difficult time understanding the "multi-head" notion in the original transformer paper. What makes the learning in each head unique? Why doesn't the neural network learn the same set of parameters for each attention head? Is it because we…

neural-networks natural-language-processing papers transformer attention

asked Dec 12 '20 at 22:22

mhsnk

113
1
4

9

votes

3 answers

What is the purpose of Decoder mask (triangular mask) in Transformer?

I'm trying to implement transformer model using this tutorial. In the decoder block of the Transformer model, a mask is passed to "pad and mask future tokens in the input received by the decoder". This mask is added to attention weights. import…

natural-language-processing transformer attention

asked Oct 03 '20 at 20:18

Uchiha Madara

143
1
7

9

votes

1 answer

What is the intuition behind the attention mechanism?

Attention idea is one of the most influential ideas in deep learning. The main idea behind attention technique is that it allows the decoder to "look back” at the complete input and extracts significant information that is useful in decoding. I am…

neural-networks deep-learning natural-language-processing attention

asked May 22 '20 at 02:43

Pluviophile

1,223
5
17
37

Questions tagged [attention]