Questions tagged [bert]

For questions related to BERT (which stands for Bidirectional Encoder Representations from Transformers), a language representation model introduced in the paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (2019) by Google.

87 questions
31
votes
3 answers

Can BERT be used for sentence generating tasks?

I am a new learner in NLP. I am interested in the sentence generating task. As far as I am concerned, one state-of-the-art method is the CharRNN, which uses RNN to generate a sequence of words. However, BERT has come out several weeks ago and is…
26
votes
1 answer

How is BERT different from the original transformer architecture?

As far as I can tell, BERT is a type of Transformer architecture. What I do not understand is: How is Bert different from the original transformer architecture? What tasks are better suited for BERT, and what tasks are better suited for the…
17
votes
1 answer

What is the intuition behind the dot product attention?

I am watching the video Attention Is All You Need by Yannic Kilcher. My question is: what is the intuition behind the dot product attention? $$A(q,K, V) = \sum_i\frac{e^{q.k_i}}{\sum_j e^{q.k_j}} v_i$$ becomes: $$A(Q,K, V) = \text{softmax}(QK^T)V$$
DRV
  • 1,573
  • 2
  • 11
  • 18
16
votes
2 answers

Why does GPT-2 Exclude the Transformer Encoder?

After looking into transformers, BERT, and GPT-2, from what I understand, GPT-2 essentially uses only the decoder part of the original transformer architecture and uses masked self-attention that can only look at prior tokens. Why does GPT-2 not…
11
votes
2 answers

What are the segment embeddings and position embeddings in BERT?

They only reference in the paper that the position embeddings are learned, which is different from what was done in ELMo. ELMo paper - https://arxiv.org/pdf/1802.05365.pdf BERT paper - https://arxiv.org/pdf/1810.04805.pdf
8
votes
1 answer

Are there transformer-based architectures that can produce fixed-length vector encodings given arbitrary-length text documents?

BERT encodes a piece of text such that each token (usually words) in the input text map to a vector in the encoding of the text. However, this makes the length of the encoding vary as a function of the input length of the text, which makes it more…
7
votes
1 answer

How to use BERT as a multi-purpose conversational AI?

I'm looking to make an NLP model that can achieve a dual purpose. One purpose is that it can hold interesting conversations (conversational AI), and another being that it can do intent classification and even accomplish the classified task. To…
5
votes
2 answers

Where can I find pre-trained language models in English and German?

Where can I find (more) pre-trained language models? I am especially interested in neural network-based models for English and German. I am aware only of Language Model on One Billion Word Benchmark and TF-LM: TensorFlow-based Language Modeling…
5
votes
2 answers

What is the Intermediate (dense) layer in between attention-output and encoder-output dense layers within transformer block in PyTorch implementation?

In PyTorch, transformer (BERT) models have an intermediate dense layer in between attention and output layers whereas the BERT and Transformer papers just mention the attention connected directly to output fully connected layer for the encoder just…
5
votes
2 answers

Transformers: how does the decoder final layer output the desired token?

In the paper Attention Is All You Need, this section confuses me: In our model, we share the same weight matrix between the two embedding layers [in the encoding section] and the pre-softmax linear transformation [output of the decoding…
user3667125
  • 1,500
  • 5
  • 13
4
votes
1 answer

Is there a pretrained (NLP) transformer that uses subword n-gram embeddings for tokenization like fasttext?

I know that several tokenization methods that are used for tranformer models like WordPiece for Bert and BPE for Roberta and others. What I was wondering if there is also a transformer which uses a method for tokenization similarly to the embeddings…
Michiel
  • 41
  • 1
4
votes
1 answer

Why aren't the BERT layers frozen during fine-tuning tasks?

During transfer learning in computer vision, I've seen that the layers of the base model are frozen if the images aren't too different from the model on which the base model is trained on. However, on the NLP side, I see that the layers of the BERT…
4
votes
1 answer

How to fine tune BERT for question answering?

I wish to train two domain-specific models: Domain 1: Constitution and related Legal Documents Domain 2: Technical and related documents. For Domain 1, I've access to a text-corpus with texts from the constitution and no question-context-answer…
4
votes
1 answer

Will BERT embedding be always same for a given document when used as a feature extractor

When we use BERT embeddings for a classification task, would we get different embeddings every time we pass the same text through the BERT architecture? If yes, is it the right way to use the embeddings as features? Ideally, while using any feature…
4
votes
1 answer

Why does the BERT encoder have an intermediate layer between the attention and neural network layers with a bigger output?

I am reading the BERT paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. As I look at the attention mechanism, I don't understand why in the BERT encoder we have an intermediate layer between the attention and…
1
2 3 4 5 6