3

I need some help with continuing pre-training on Bert. I have a very specific vocabulary and lots of specific abbreviations at hand. I want to do an STS task. Let me specify my task: I have domain-specific sentences and want to pair them in terms of their semantic similarity. But as very uncommon language is used here, I need to train Bert on it.

  • How does one continue the pre-training (I read the GitHub release from google about it, but don't really understand it) Any examples?
  • What structure does my training data need to have, so that BERT can understand it?
  • Maybe training BERT from scratch would be even better. I guess it's the same process as continuing the pretraining just the starting checkpoint would be different. Is that correct?

Also, very happy about all other tips from you guys.

nbro
  • 39,006
  • 12
  • 98
  • 176
Adrian_G
  • 31
  • 1
  • I’m looking for answers to the same question. This may help: https://github.com/allenai/dont-stop-pretraining Not dug into the code yet, but it’s all I’ve found so far. Did you have more luck? Cheers B – barry_normal Sep 18 '21 at 18:37

0 Answers0