They only reference in the paper that the position embeddings are learned, which is different from what was done in ELMo.
ELMo paper - https://arxiv.org/pdf/1802.05365.pdf
BERT paper - https://arxiv.org/pdf/1810.04805.pdf
They only reference in the paper that the position embeddings are learned, which is different from what was done in ELMo.
ELMo paper - https://arxiv.org/pdf/1802.05365.pdf
BERT paper - https://arxiv.org/pdf/1810.04805.pdf
Sentences (for those tasks such as NLI which take two sentences as input) are differentiated in two ways in BERT:
[SEP]
token is put between themThat is, there are just two possible "segment embeddings": $E_A$ and $E_B$.
Positional embeddings are learned vectors for every possible position between 0 and 512-1. Transformers don't have a sequential nature as recurrent neural networks, so some information about the order of the input is needed; if you disregard this, your output will be permutation-invariant.
These embeddings are nothing more than token embeddings.
You just randomly initialize them, then use gradient descent to train them, just like what you do with token embeddings.