I need some help with continuing pre-training on Bert. I have a very specific vocabulary and lots of specific abbreviations at hand. I want to do an STS task. Let me specify my task: I have domain-specific sentences and want to pair them in terms of their semantic similarity. But as very uncommon language is used here, I need to train Bert on it.
- How does one continue the pre-training (I read the GitHub release from google about it, but don't really understand it) Any examples?
- What structure does my training data need to have, so that BERT can understand it?
- Maybe training BERT from scratch would be even better. I guess it's the same process as continuing the pretraining just the starting checkpoint would be different. Is that correct?
Also, very happy about all other tips from you guys.