1

Let’s imagine two different use cases for a LLM/GPT-3.

  1. Predicting the next most likely word in a sequence using all ~50k words in its dictionary (i.e. the standard method of prompting a LLM)
  2. Checking whether "Word-1" is more likely than "Word-2" to be next in a sequence

How much more computationally efficient is #2?

My understanding is that the computation of the attention mechanism is dependent on the length of the prompt (so will be the same) and takes up most of the computation needed to get the output (but to what extent, I'm not sure). The difference will be in the decoding stage.

Would the one matrix multiplication in the decoding calculation be the only computation using the smaller 2-row matrix instead of the 50k-row matrix or are there other improvements in efficiency?

Derek
  • 11
  • 1

0 Answers0