How does one continue the pre-training in BERT?

Asked Mar 05 '20 at 15:02

Active Sep 20 '21 at 00:03

Viewed 629 times

I need some help with continuing pre-training on Bert. I have a very specific vocabulary and lots of specific abbreviations at hand. I want to do an STS task. Let me specify my task: I have domain-specific sentences and want to pair them in terms of their semantic similarity. But as very uncommon language is used here, I need to train Bert on it.

How does one continue the pre-training (I read the GitHub release from google about it, but don't really understand it) Any examples?
What structure does my training data need to have, so that BERT can understand it?
Maybe training BERT from scratch would be even better. I guess it's the same process as continuing the pretraining just the starting checkpoint would be different. Is that correct?

Also, very happy about all other tips from you guys.

edited Sep 20 '21 at 00:03

nbro

39,006
12
98
176

asked Mar 05 '20 at 15:02

Adrian_G

I’m looking for answers to the same question. This may help: https://github.com/allenai/dont-stop-pretraining Not dug into the code yet, but it’s all I’ve found so far. Did you have more luck? Cheers B – barry_normal Sep 18 '21 at 18:37

How does one continue the pre-training in BERT?

0 Answers0