3

Is there any particular reason that the most recent and successful large language models like GPT-3 or Bloom utilize a vanilla Transformer architecture instead of an arguably superior long sequence architecture like, e.g. Transformer-XL, LongFormer, BigBird, etc.?

In case you have any ideas or insights, please let me know.

Robin van Hoorn
  • 1,810
  • 7
  • 32
hokage555
  • 31
  • 2

0 Answers0