Is there any particular reason that the most recent and successful large language models like GPT-3 or Bloom utilize a vanilla Transformer architecture instead of an arguably superior long sequence architecture like, e.g. Transformer-XL, LongFormer, BigBird, etc.?
In case you have any ideas or insights, please let me know.