Computation required for GPT model to choose likely word from n-options where n < total vocabulary size

Asked Feb 05 '23 at 00:36

Active Feb 05 '23 at 00:36

Viewed 42 times

Let’s imagine two different use cases for a LLM/GPT-3.

Predicting the next most likely word in a sequence using all ~50k words in its dictionary (i.e. the standard method of prompting a LLM)
Checking whether "Word-1" is more likely than "Word-2" to be next in a sequence

How much more computationally efficient is #2?

My understanding is that the computation of the attention mechanism is dependent on the length of the prompt (so will be the same) and takes up most of the computation needed to get the output (but to what extent, I'm not sure). The difference will be in the decoding stage.

Would the one matrix multiplication in the decoding calculation be the only computation using the smaller 2-row matrix instead of the 50k-row matrix or are there other improvements in efficiency?

asked Feb 05 '23 at 00:36

Derek

Computation required for GPT model to choose likely word from n-options where n < total vocabulary size

0 Answers0