Highest Voted 'bert' Questions - Artificial Intelligence Stack Exchange

31

votes

3 answers

Can BERT be used for sentence generating tasks?

I am a new learner in NLP. I am interested in the sentence generating task. As far as I am concerned, one state-of-the-art method is the CharRNN, which uses RNN to generate a sequence of words. However, BERT has come out several weeks ago and is…

asked Nov 24 '18 at 13:44

ch271828n

413
1
4
11

26

votes

1 answer

How is BERT different from the original transformer architecture?

As far as I can tell, BERT is a type of Transformer architecture. What I do not understand is: How is Bert different from the original transformer architecture? What tasks are better suited for BERT, and what tasks are better suited for the…

natural-language-processing comparison transformer bert

asked Aug 24 '20 at 14:56

chessprogrammer

2,215
2
12
23

17

votes

1 answer

What is the intuition behind the dot product attention?

I am watching the video Attention Is All You Need by Yannic Kilcher. My question is: what is the intuition behind the dot product attention? $$A(q,K, V) = \sum_i\frac{e^{q.k_i}}{\sum_j e^{q.k_j}} v_i$$ becomes: $$A(Q,K, V) = \text{softmax}(QK^T)V$$

natural-language-processing papers transformer attention bert

asked Apr 11 '20 at 12:53

DRV

1,573
2
11
18

16

votes

2 answers

Why does GPT-2 Exclude the Transformer Encoder?

After looking into transformers, BERT, and GPT-2, from what I understand, GPT-2 essentially uses only the decoder part of the original transformer architecture and uses masked self-attention that can only look at prior tokens. Why does GPT-2 not…

natural-language-processing transformer attention bert gpt

asked Mar 27 '21 at 19:55

Athena Wisdom

311
2
5

11

votes

2 answers

What are the segment embeddings and position embeddings in BERT?

They only reference in the paper that the position embeddings are learned, which is different from what was done in ELMo. ELMo paper - https://arxiv.org/pdf/1802.05365.pdf BERT paper - https://arxiv.org/pdf/1810.04805.pdf

machine-learning deep-learning natural-language-processing bert

asked Jan 22 '19 at 11:01

Skinish

153
1
1
9

8

votes

1 answer

Are there transformer-based architectures that can produce fixed-length vector encodings given arbitrary-length text documents?

BERT encodes a piece of text such that each token (usually words) in the input text map to a vector in the encoding of the text. However, this makes the length of the encoding vary as a function of the input length of the text, which makes it more…

natural-language-processing reference-request autoencoders transformer bert

asked Sep 15 '20 at 16:16

HelloGoodbye

313
1
11

7

votes

1 answer

How to use BERT as a multi-purpose conversational AI?

I'm looking to make an NLP model that can achieve a dual purpose. One purpose is that it can hold interesting conversations (conversational AI), and another being that it can do intent classification and even accomplish the classified task. To…

natural-language-processing classification bert language-model

asked Dec 14 '19 at 13:55

junfanbl

323
1
7

5

votes

2 answers

Where can I find pre-trained language models in English and German?

Where can I find (more) pre-trained language models? I am especially interested in neural network-based models for English and German. I am aware only of Language Model on One Billion Word Benchmark and TF-LM: TensorFlow-based Language Modeling…

neural-networks natural-language-processing bert gpt language-model

asked Aug 23 '18 at 07:17

Lutz Büch

161
7

5

votes

2 answers

What is the Intermediate (dense) layer in between attention-output and encoder-output dense layers within transformer block in PyTorch implementation?

In PyTorch, transformer (BERT) models have an intermediate dense layer in between attention and output layers whereas the BERT and Transformer papers just mention the attention connected directly to output fully connected layer for the encoder just…

natural-language-processing pytorch transformer bert

asked Oct 25 '21 at 20:05

mohammad ali Humayun

51
3

5

votes

2 answers

Transformers: how does the decoder final layer output the desired token?

In the paper Attention Is All You Need, this section confuses me: In our model, we share the same weight matrix between the two embedding layers [in the encoding section] and the pre-softmax linear transformation [output of the decoding…

natural-language-processing transformer bert attention

asked Dec 06 '20 at 08:25

user3667125

1,500
5
13

4

votes

1 answer

Is there a pretrained (NLP) transformer that uses subword n-gram embeddings for tokenization like fasttext?

I know that several tokenization methods that are used for tranformer models like WordPiece for Bert and BPE for Roberta and others. What I was wondering if there is also a transformer which uses a method for tokenization similarly to the embeddings…

transformer word-embedding bert

asked Nov 03 '20 at 22:44

Michiel

41
1

4

votes

1 answer

Why aren't the BERT layers frozen during fine-tuning tasks?

During transfer learning in computer vision, I've seen that the layers of the base model are frozen if the images aren't too different from the model on which the base model is trained on. However, on the NLP side, I see that the layers of the BERT…

natural-language-processing computer-vision bert transfer-learning fine-tuning

asked Oct 03 '20 at 11:03

Bunny Rabbit

141
2

4

votes

1 answer

How to fine tune BERT for question answering?

I wish to train two domain-specific models: Domain 1: Constitution and related Legal Documents Domain 2: Technical and related documents. For Domain 1, I've access to a text-corpus with texts from the constitution and no question-context-answer…

natural-language-processing bert fine-tuning question-answering

asked Jul 06 '20 at 09:53

Anirban Saha

141
1
3

4

votes

1 answer

Will BERT embedding be always same for a given document when used as a feature extractor

When we use BERT embeddings for a classification task, would we get different embeddings every time we pass the same text through the BERT architecture? If yes, is it the right way to use the embeddings as features? Ideally, while using any feature…

machine-learning natural-language-processing word-embedding bert

asked Jun 03 '19 at 10:45

Srikant Jayaraman

119
1
4

4

votes

1 answer

Why does the BERT encoder have an intermediate layer between the attention and neural network layers with a bigger output?

I am reading the BERT paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. As I look at the attention mechanism, I don't understand why in the BERT encoder we have an intermediate layer between the attention and…

deep-learning natural-language-processing papers attention bert

asked Mar 14 '19 at 14:15

Jonor

161
4

Questions tagged [bert]