Does ChatGPT use different transformers for different downstream tasks?

Question

What I find hard to figure out is whether ChatGPT guesses from the prompt the downstream NLP task to be performed - text summary, text generation, question-answering, doing logic or arithmetic, translation, sentiment or style analysis - and then uses specialized decoders/transformers. Or if there is only one transformer which handles all downstream tasks. How then can it be understood that ChatGPT performs so well in so many tasks - as if it used specialized transformers.

If it guesses the task: How is it done (in high-level terms) and how does it switch?

The answer may be so clear (for the experts) that it is never mentioned explicitly, but for the non-expert it is hard to tell (and to believe).

(Maybe it's easier to answer the question if there is a specific and specifically trained transformer for each supported language.)

BTW: Why is the task "to follow instructions" (which InstructGPT is said to be specialized for) a task on its own? Isn't every prompt an instruction in a sense, instructing ChatGPT to perform some downstream task?

Ciodar · Answer 1 · 2023-05-16T09:33:34.153

It is just one huge model which performs autoregressive text generation. The ability to perform a wide variety of task, defined at inference time is called in-context learning and was introduced in the GPT-3 paper.

The underlying idea is that during self-supervised pretraining, the model "sees" a huge variety of sequences and tasks, and learns to recognize very high-level patterns which identify the tasks (i.e the model is able to recognize the task even if we specify a different syntax ), and how the tasks are performed. This knowledge is used at inference time to perform the correct task.

This ability is largely dependent with model size and emerges at hundreds of billions of parameters

The task to perform is inferred from the context, i.e the text which specifies the desired task. The context may be composed of a description of the task, followed by

A few examples of desired output (few-shot approach)
Only of the structure the output needs to follow (zero-shot approach).

(Maybe it's easier to answer the question if there is a specific and specifically trained transformer for each supported language.)

A huge problem is that you need a specific, large dataset for each language. That means if you want to obtain a similar performance in all language, you should have a dataset of roughly the same size in each language, but it is not the case for many datasets. For example, C4 dataset has 4B english samples, but only 545M in German. One way Large Language Models can overcome this issue is by exploiting the knowledge provided by many languages, learning to reason and align this knowledge in the destination language.

BTW: Why is the task "to follow instructions" (which InstructGPT is said to be specialized for) a task on its own? Isn't every prompt an instruction in a sense, instructing Chat GPT to perform some downstream task?

Yes, but there are many ways to answer, and there is no standard way to define what a "good" answer is since it may be task and context-dependent, so it would be hard to train a good model which gives high-quality responses to a wide variety of tasks.

To add these additional supervising signals of "goodness", they used Reinforcement Learning, avoiding to define a specialized loss for each task (which could be intractable) and using human feedback instead.This is a very important part of ChatGPT and explains its better performance compared to GPT-3.

Thanks, this was helpful. What sounds like an enigma is "the task to perform is inferred from the prompt". That's *very* high-level. Is it done beforehand, in the first steps or can it not be "localized" at all? — Hans-Peter Stricker, May 14 '23 at 14:52
@Hans-PeterStricker There is no *explicit* task classification going on. To the best of my knowledge, every GPT (including GPT-3 and 4) consists in a stack of Transformer layers. Every possible task is just transformed in a causal text generation, but this means that 1) To "identify" or "localize" which transformer block identified the task, you should look into the attention weights, which may only give you a hint. 2) You cannot be *sure* that the model follows your instructions or identifies the correct task — Ciodar, May 14 '23 at 16:53
Do you have references specific to ChatGPT? Because here you are making inferences about GPT3, I don't think there is public information exactly on how ChatGPT works. — Dr. Snoopy, May 14 '23 at 19:46
@Dr.Snoopy yes, we are not *sure* about what it is inside ChatGPT, but as written on the [OpenAI page](https://openai.com/blog/chatgpt) of ChatGPT, it may not differ so much from InstructGPT in the details I wrote in the answer. Here is the precise statement "We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup." — Ciodar, May 14 '23 at 19:59
My point is that the information available on GPT3/InstructGPT does not necessarily apply at 100% to ChatGPT since OpenAI did not release (on purpose) all the details. Specially since there is no way to verify OpenAI claims. — Dr. Snoopy, May 14 '23 at 20:11
@Dr.Snoopy I'll try to include these assumptions in the answer, feel free to give any suggestions! I found an [answer](https://ai.stackexchange.com/questions/39023/are-gpt-3-5-series-models-based-on-gpt-3) where they tried to answer these exact claims. I agree it is hard to answer with 100% factual correctness on ChatGPT questions, but I tried to build on the knowledge we have. — Ciodar, May 14 '23 at 20:33

Does ChatGPT use different transformers for different downstream tasks?

1 Answers1