Adding corpus to BERT for QA

Question

I was wondering about SciBERT's QA abilities using SQuAD. I have a scarce textual dataset consisting of less than 100 files where doctors are discussing cancer in dialogues. I want to add it to SciBERT to see if the QA abilities will improve in the cancer disease domain.

After concatenating them into one large file which will be our vocab, I then clean the file (all char to lower, white space splitting, char filtering, punctuation, stopword filtering, short tokens and etc) which leaves me with a list of 3000 unique tokens

If I wanted to add these tokens, do I just do scibert_tokenizer.add_tokens(myList) where myList is the 3k tokens?

I can confirm that more tokens are added doing print(len(scibert_tokenizer)) and I can see that embeddings do change such as corona and ##virus changes to coronavirus and ##virus.

Does the model need to be trained from scratch again?

Is this a programming question? [Programming issues are off-topic here](https://ai.stackexchange.com/help/on-topic), but maybe this is just a conceptual question. Can you clarify whether your question is about a programming issue or not? Could you also put your main **specific** question in the title? It seems that you have 2 questions here, but, ideally, every post contains only 1 question. — nbro, Feb 23 '21 at 10:49

Adding corpus to BERT for QA

0 Answers0