What are the segment embeddings and position embeddings in BERT?

Question

They only reference in the paper that the position embeddings are learned, which is different from what was done in ELMo.

ELMo paper - https://arxiv.org/pdf/1802.05365.pdf

BERT paper - https://arxiv.org/pdf/1810.04805.pdf

finiteautomata · Answer 1 · 2021-07-10T05:29:05.200

Sentences (for those tasks such as NLI which take two sentences as input) are differentiated in two ways in BERT:

First, a [SEP] token is put between them
Second, a learned embedding $E_A$ is concatenated to every token of the first sentence, and another learned vector $E_B$ to every token of the second one

That is, there are just two possible "segment embeddings": $E_A$ and $E_B$.

Positional embeddings are learned vectors for every possible position between 0 and 512-1. Transformers don't have a sequential nature as recurrent neural networks, so some information about the order of the input is needed; if you disregard this, your output will be permutation-invariant.

score 3 · Accepted Answer · answered Feb 17 '19 at 09:16

3

These embeddings are nothing more than token embeddings.

You just randomly initialize them, then use gradient descent to train them, just like what you do with token embeddings.

answered Feb 17 '19 at 09:16

soloice

511
4
7

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/127410/discussion-on-answer-by-soloice-what-are-the-segment-embeddings-and-position-emb). – nbro Jul 10 '21 at 22:06

What are the segment embeddings and position embeddings in BERT?

2 Answers2