11

enter image description here

They only reference in the paper that the position embeddings are learned, which is different from what was done in ELMo.

ELMo paper - https://arxiv.org/pdf/1802.05365.pdf

BERT paper - https://arxiv.org/pdf/1810.04805.pdf

nbro
  • 39,006
  • 12
  • 98
  • 176
Skinish
  • 153
  • 1
  • 1
  • 9

2 Answers2

4

Sentences (for those tasks such as NLI which take two sentences as input) are differentiated in two ways in BERT:

  • First, a [SEP] token is put between them
  • Second, a learned embedding $E_A$ is concatenated to every token of the first sentence, and another learned vector $E_B$ to every token of the second one

That is, there are just two possible "segment embeddings": $E_A$ and $E_B$.

Positional embeddings are learned vectors for every possible position between 0 and 512-1. Transformers don't have a sequential nature as recurrent neural networks, so some information about the order of the input is needed; if you disregard this, your output will be permutation-invariant.

3

These embeddings are nothing more than token embeddings.

You just randomly initialize them, then use gradient descent to train them, just like what you do with token embeddings.

soloice
  • 511
  • 4
  • 7
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/127410/discussion-on-answer-by-soloice-what-are-the-segment-embeddings-and-position-emb). – nbro Jul 10 '21 at 22:06