I am currently trying to understand transformers.
To start, I read Attention Is All You Need and also this tutorial.
What makes me wonder is the word embedding used in the model. Is word2vec or GloVe being used? Are the word embeddings trained from scratch?
In the tutorial linked above, the transformer is implemented from scratch and nn.Embedding from pytorch is used for the embeddings. I looked up this function and didn't understand it well, but I tend to think that the embeddings are trained from scratch, right?