4

I know that several tokenization methods that are used for tranformer models like WordPiece for Bert and BPE for Roberta and others. What I was wondering if there is also a transformer which uses a method for tokenization similarly to the embeddings that are used in the fasttext library, so based on the summations of embeddings for the n-grams the words are made of.

To me it seems weird that this way of creating word(piece) embeddings that can function as the input of a transformer isn't used in these new transformer architectures. Is there a reason why this is not tried yet? Or is this question just an result of my inability to find the right papers/repo's.

Michiel
  • 41
  • 1

1 Answers1

2

There is a pre-trained language model called ProphetNet for sequence-to-sequence learning with a novel self-supervised objective called future n-gram prediction.

https://github.com/microsoft/ProphetNet

Also, there are few variants on hugging face website as well https://huggingface.co/models?search=ProphetNet

usct01
  • 121
  • 3