What kind of word embedding is used in the original transformer?

Question

I am currently trying to understand transformers.

To start, I read Attention Is All You Need and also this tutorial.

What makes me wonder is the word embedding used in the model. Is word2vec or GloVe being used? Are the word embeddings trained from scratch?

In the tutorial linked above, the transformer is implemented from scratch and nn.Embedding from pytorch is used for the embeddings. I looked up this function and didn't understand it well, but I tend to think that the embeddings are trained from scratch, right?

If you were to implement a Transformer or Word2Vec yourself with PyTorch, in both cases you will probably be using nn.Embedding. Agreed, the original paper is not clear on this. (The other parts of your question have already been answered well by you and in comments.) — 0dB, Jun 10 '23 at 07:15

score 14 · Accepted Answer · edited Feb 06 '21 at 15:14

14

I have found a good answer in this blog post The Transformer: Attention Is All You Need:

we learn a “word embedding” which is a smaller real-valued vector representation of the word that carries some information about the word. We can do this using nn.Embedding in Pytorch, or, more generally speaking, by multiplying our one-hot vector with a learned weight matrix W.

There are two options for dealing with the Pytorch nn.Embedding weight matrix. One option is to initialize it with pre-trained embeddings and keep it fixed, in which case it’s really just a lookup table. Another option is to initialize it randomly, or with pre-trained embeddings, but keep it trainable. In that case the word representations will get refined and modified throughout training because the weight matrix will get refined and modified throughout training.

The Transformer uses a random initialization of the weight matrix and refines these weights during training – i.e. it learns its own word embeddings.

edited Feb 06 '21 at 15:14

nbro

39,006
12
98
176

answered Feb 06 '21 at 14:09

Bert Gayus

545
3
12

3

Please, if you're quoting something from another source, prepend `>` to the excerpt you're quoting. See [my edit](https://ai.stackexchange.com/posts/26246/edit) to understand how to do it. – nbro Feb 06 '21 at 15:15
This response is confusing because it basically says it could be either trained or not trained. But I assume there is a single overall approach taken by the initial transformers paper, gpt-*, BERT, BLOOM, etc. The original Vaswani paper and GPT papers don't mention anything about an initial word embedding, nor do they mention anything about a trainable embedding matrix -- only trainable projections. So which one is it in those specific cases? Is the initial embedding trained or not? – Paul Mar 01 '23 at 15:46
When you set up a transformer you can choose if you want to train from scratch or with pre-trained embeddings, and for the latter, whether it should stay fixed or continue to be trained, just like the answer says. Embeddings and projections are the same concept, the first applied to tokens (non-numbers), the second to numbers, but sometimes you see the terms used synonymously. – 0dB Jun 10 '23 at 06:34

Brian O'Donnell · Answer 2 · 2021-02-05T21:06:54.713

5

No, neither Word2Vec nor GloVe is used as Transformers are a newer class of algorithms. Word2Vec and GloVe are based on static word embeddings while Transformers are based on dynamic word embeddings.

The embeddings are trained from scratch.

edited Feb 05 '21 at 21:06

answered Feb 05 '21 at 20:25

Brian O'Donnell

1,853
6
20

Maybe you should provide more details about the embeddings used by the transformers (apart from saying that they are learned from scratch). – nbro Feb 05 '21 at 21:11
Embeddings in Word2Vec are also trained from scratch – Alex Feb 06 '21 at 00:09
2

And how are they trained from scratch then? In the tutorials I have read yet that part is never mentioned. Every time just the attention part is focussed and the embedding part is skipped. – Bert Gayus Feb 06 '21 at 12:35
This answer is deceptive. After an input sequence has been tokenized, the transformer's first step will be to use a STATIC matrix to find that token's embedding. Bert Gayus' answer gives some insight into how that static matrix is trained. – David Skarbrevik Jan 17 '23 at 00:09
If you were to implement a Transformer or Word2Vec yourself with PyTorch, in both cases you will probably be using nn.Embedding. No, the embeddings are not necessarily trained from scratch, see my comment on the accepted answer. – 0dB Jun 10 '23 at 07:06
Re. static (Word2Vec) and dynamic/contextual (Transformer) embeddings, for the Transformer you would still use PyTorch nn.Embedding ("static") but the attention mechanism disambiguates homonyms (words with different meaning but spelled the same) using the context of the words. – 0dB Jun 11 '23 at 13:52

Raul Alvarez · Answer 3 · 2021-05-01T12:43:56.187

0

As "initial" word embeddings (those without any positional or context information for each word or sub word) are used from the very beginning It seems to me that someone has to provide a trained embedding for each word at the very beginning.

edited May 01 '21 at 12:43

answered May 01 '21 at 07:06

Raul Alvarez

122
1
12

1

They can and will be trained from random starting values. – 0dB May 15 '23 at 18:20

What kind of word embedding is used in the original transformer?

3 Answers3

Linked