13

When ChatGPT is generating an answer to my question, it generates it word by word.
So I actually have to wait until I get the final answer.

Is this just for show?
Or is it really real-time generating the answer word by word not knowing yet what the next word will be?

Why does it not give the complete answer text all at once?

Rexcirus
  • 1,131
  • 7
  • 19
  • 7
    It is just for show. – Dr. Snoopy Jan 27 '23 at 15:54
  • 2
    It also does in fact generate the output 1 token at a time, not "knowing" what the next word will be, because each token added to the sequence changes the probablities for further tokens, and this process is necessarily sequential. However, it is unlikely to be doing a whole round trip between the model running on a GPU and the web interface for each token, that would be very inefficient. – Neil Slater Jan 27 '23 at 20:23
  • 1
    It can also be for inducing us to read it all. Should it dump it all at once, we would be more likely to skim over the answer or even quit. It helps engagement. – Jaume Oliver Lafont Mar 01 '23 at 10:00
  • @Dr.Snoopy No, it is not just for show. It literally does not know the later tokens when it generates a token, the output is real time. – Volker Siegel Apr 13 '23 at 19:49
  • @VolkerSiegel How do you know that? any references? – Dr. Snoopy Apr 14 '23 at 10:09
  • 2
    @Dr.Snoopy If you use the API, you can specify stream=True if you to not want the server to wait until the whole response is ready, but return it as many small "events". A large completion can take quite a long time, so it makes a big difference wheter you get the first result after few seconds or a minute. On the api level, it would be somewhat absurd to have a built in delay on the server side. And the delay you see in a web interface behaves exactly like the API. https://github.com/openai/openai-cookbook/blob/main/examples/How_to_stream_completions.ipynb – Volker Siegel Apr 14 '23 at 19:53
  • @Dr.Snoopy Another strong argument to expect that the output is an incremental stream like we see is: That is hoe GPT-4 fundamentally works, the output is an individual token, repeatedly. Or even a list of tokens with probabilities what could be the next token. Creating incremental output, token by token is "just what a GPT does", fundamentally. A token is about 4 characters long on average, it can be a single character, or a word that occurs frequently. – Volker Siegel Apr 14 '23 at 19:59

2 Answers2

8

ChatGPT is a conversational-agent based on GPT3.5, which is a causal language model. Under the hood, GPT works by predicting the next token when provided with an input sequence of words. So yes, at each step a single word is generated taking into consideration all the previous words.

See for instance this Hugging Face tutorial.

To further explain: while outputting entire sequences of words is in principle possible, it would require a huge amount of data, since the probability of each sequence in the space of all sequences is extremely small. Instead, building a probability distribution over half a million of english words is feasible (in reality just a tiny fraction of those words is often used).

On top of that, there may be some scenic effect, to simulate the AI "typing" the answer.

Rexcirus
  • 1,131
  • 7
  • 19
  • 5
    I think you may be right, given my knowledge of these models. However, ChatGPT could also generate all text (word-by-word or not) before showing it to the user, then show it to the user all at once. So, in a way, this is also for show. – nbro Feb 01 '23 at 09:57
  • Although the first three paragraphs of your answer are true, I'd be surprised if they were actually the direct cause of displaying the answer one word at a time. More likely, this is just for scenic effect, as per your last sentence, and perhaps also to slow the conversation down so that the server doesn't receive too many requests at all. – Stef Mar 31 '23 at 15:29
  • @nbro That would be just collecting output on the server side to return it as one document. The API actually does that, it is easier to use in programming when you do not care about the time and would otherwise need to collect the parts yourselves. You specify `stream=True` in the API for incremental results. – Volker Siegel Apr 14 '23 at 20:04
  • @VolkerSiegel Yes, I've worked with their quite unreliable (slow, often unavailable, etc.) API. – nbro Apr 17 '23 at 12:07
4

Why does ChatGPT not give the answer text all at once?

Because ChatGPT is autoregressive (=generates each new word by looking at previous words), as Rexcirus mentioned.

Is this just for show?

On https://beta.openai.com/playground, output words/tokens are displayed faster when using smaller models such as text-curie-001 than larger ones such as text-davinci-003. I.e., the inference time does seem to impact the display time.

https://twitter.com/ArtificialAva/status/1624411499375603715 compared the display speed of ChatGPT vs. ChatGPT Plus vs. ChatGPT Turbo Mode and showed that ChatGPT Turbo Mode is over twice faster to display the output, which further indicates that ChatGPT shows its response word by word due to its backend (computation time + autoregressive).

Franck Dernoncourt
  • 2,626
  • 1
  • 19
  • 31
  • 1
    Or maybe the chatbot is just intentionally slowing the conversation down by displaying the words slowly, to avoid the server drowning under the huge number of users. And then if you register for turbo mode then the server gives you higher priority, and displays the words faster, regardless of how fast they are actually generated by the algorithm. – Stef Mar 31 '23 at 15:26
  • 1
    @Stef No, it is all real time output from different models, it is more clear when you use the API directly. – Volker Siegel Apr 13 '23 at 19:52
  • @VolkerSiegel Definitely, we've used the API quite a lot for https://arxiv.org/pdf/2304.05613.pdf and some other projects. – Franck Dernoncourt Apr 27 '23 at 03:14