I was wondering about SciBERT's QA abilities using SQuAD. I have a scarce textual dataset consisting of less than 100 files where doctors are discussing cancer in dialogues. I want to add it to SciBERT to see if the QA abilities will improve in the cancer disease domain.
After concatenating them into one large file which will be our vocab, I then clean the file (all char to lower, white space splitting, char filtering, punctuation, stopword filtering, short tokens and etc) which leaves me with a list of 3000 unique tokens
If I wanted to add these tokens, do I just do scibert_tokenizer.add_tokens(myList)
where myList is the 3k tokens?
I can confirm that more tokens are added doing print(len(scibert_tokenizer))
and I can see that embeddings do change such as corona
and ##virus
changes to coronavirus
and ##virus
.
Does the model need to be trained from scratch again?