What makes reproducing a model like GPT3/GPT3.5/ChatGPT difficult?

Question

Is it difficult for other companies to train a model similar to ChatGPT, and what makes it difficult? What is challenging about reproducing the results obtained by OpenAI with ChatGPT/GPT3.5? Would it be possible for a company like Meta or Google to have a model equal to ChatGPT/GPT3.5 in the next month or so? Why or why not?

I understand that a big language model is expensive to train, so I'm expecting only large companies to be able to train such models to a sufficient extent.

Can you provide more details about "oh its challenging Google" and about the articles and videos that claim that? It's challenging but most likely not because they don't have the computation power, but for other reasons, which are probably not true. — nbro, Jan 25 '23 at 09:28
Im not interested in whether or not the articles or videos are true but only whether it is difficult for others to just replicate their work and what the reasons are for that. I removed the whole 'videos and articles'. — Robin van Hoorn, Jan 25 '23 at 10:13
Ok. I think now the question in the body is more objective. I would also change the title then. Another thing is: if you focus on the ChatGPT, then people could simple answer by saying that it's difficult because there's still no research paper available. Maybe you're interested just in ChatGPT or maybe in GPT3. I don't know. Make it clear if you're just interested in Google or other companies and even "normal people" — nbro, Jan 25 '23 at 10:47
Are you aware that Google made Lamda? https://blog.google/technology/ai/lamda/ — Dr. Snoopy, Jan 25 '23 at 12:46
Another thing to note is that the transformer model was developed by Google Research. — nbro, Jan 25 '23 at 14:04
@nbro and attention in RNNs was invented by academic researches and positional-encodings in RNNs by FB (AFAIK). Giving all the credit to one team is ridiculous. — Mariah, Jan 31 '23 at 21:57
@Mariah You didn't understand the point of that comment. The point of that comment was: if Google researchers came up with the transformer, then they are familiar with the transformer and its potential. GPTs are nothing special. They just received more hype than other pre-trained models so that OpenAI gets more money. Google doesn't have to generate this hype to get the money, but they also developed other pre-trained models. — nbro, Feb 01 '23 at 08:35

Franck Dernoncourt · Accepted Answer · 2023-02-23T09:16:23.443

8

Challenges to reproduce ChatGPT:

Compute cost
Collect training data
Find the proper choice for network architecture + RL (OpenAI hasn't published all the details)

Two interesting papers on training cost vs. LLM quality:

For some tasks, "smaller LLMs" can perform well e.g. see Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, Tatsunori B. Hashimoto. Benchmarking Large Language Models for News Summarization. arXiv:2301.13848.:

We find instruction tuning, and not model size, is the key to the LLM’s zero-shot summarization capability

edited Feb 23 '23 at 09:16

answered Jan 31 '23 at 07:08

Franck Dernoncourt

2,626
1
19
31

1

Indeed, compute cost is not trivial! Some estimates place the cost of training the model once at several million dollars, let alone the personnel costs, costs for data-gathering, and cost of any kind of hyper-parameter tuning or architecture search. – Sycorax Feb 21 '23 at 23:52
Also, the human cost of having someone manually label the data. – Anshuman Kumar Mar 29 '23 at 09:07
@AnshumanKumar yes I was including it in collect training data. There are some open access data eg C4 + https://github.com/nomic-ai/gpt4all – Franck Dernoncourt Mar 29 '23 at 09:15

nbro · Answer 2 · 2023-04-18T21:52:15.503

2

Actually, Google created a bigger model than GPT-3 and models in the GPT-3.5 series, and consequently ChatGPT too (because ChatGPT is based on a GPT-3.5 model) - Switch-C has trillions of parameters, one order of magnitude bigger than the GPT models that I know of, and it was developed before ChatGPT was announced. I don't know how many parameters ChatGPT has exactly, but it shouldn't have more than several billions of parameters.

So, what makes reproducing a model like ChatGPT difficult for companies like Google? Definitely, not the lack of computational resources or money, but the lack of transparency. My impression is that Google also tends to be open-source, as opposed to OpenAI, which wants to make money of everything.

Moreover, I'd like to note that the GPT models have received a lot of hype, but there are other pre-trained models (e.g. Lambda or Switch-C), for example, developed by Google, that maybe should also have our attention. Google simply doesn't need to generate all this hype to get the money, as they still get most of their revenue from ads (the last time I checked)

edited Apr 18 '23 at 21:52

answered Feb 01 '23 at 10:25

nbro

39,006
12
98
176

There is something I'm not sure of and Icant find an answer even in OpenAI's blog. Is GPT-3.5 (series) a complete separate model thas has been trained from scratch or is it a fine tuned version of GPT-3 for code-completion and conversations ? – iMad Feb 02 '23 at 13:13
1

@iMad The only reliable info that I found about this topic was here: https://platform.openai.com/docs/model-index-for-researchers. Maybe it's not accurate to say that OpenAI claims that _all_ models in this series are more capable. I should rewrite that part of my answer – nbro Feb 02 '23 at 13:39
Actually I wasn't referring to your answer specifically, it's just that I can't find any reliable information about how GPT-3.5 has been trained, its differences with GPT-3 etc. In OpenAI's docs, it's mentioned that InstructGPT is somehow a fine-tuned version of GPT-3, and that ChatGPT was trained in the same way as InstructGPT, but no explicit claim about the relationship between the GPT-3 base model and the GPT-3.5 series. – iMad Feb 02 '23 at 16:24
I think you need to back up that middle paragraph. AFAIK, OpenAI publish papers on their models, and explain how ChatGPT was trained here - https://openai.com/blog/chatgpt/ (also see links to their publications on Arxiv). Google's LLMs are not open source. So I am not seeing a difference in transparency between the two companies regarding these kind of products, unless you can provide more information – Neil Slater Feb 22 '23 at 09:25
@NeilSlater I'm saying that my impression is that Google tends to be open-source, while OpenAI doesn't. My impression could not be aligned with the reality. I'm not saying that Google is transparent about everything or more transparent than OpenAI about anything. I'm saying that what makes reproducing a model like ChatGPT is the lack of transparency. I'm also not saying that there isn't info about ChatGPT. I'm saying there isn't enough info about everything in order to reproduce it exactly. – nbro Feb 22 '23 at 11:11
@NeilSlater OpenAI also has open-source code, but making money of a GPT3 model for every char that you pass to the model or the model produces... while you use Google Translate and Google Search engine for free for years... To me, that's a clear sign of the mentality and political views/approaches of a company. I understand that OpenAI doesn't have the luxury of ads that Google has, but still I don't agree with their financial/political policy – nbro Feb 22 '23 at 11:14
@NeilSlater Also note, the original question was "Can (i.e.) Google easily train a ChatGPT like model?", so clearly the OP was interested in knowing about if a specific company like Google can do it, not if every company can do it, which is what the other answer addresses. But then the question was changed and it's fine. – nbro Feb 22 '23 at 11:22
From your comment, then I think the second paragraph is muddled. It reads - to me - as a comparison on transparency and business plan between OpenAI and Google due to mentioning lack of transparency as a cause then a comparison on code release practices and revenue sources in the next sentence. I also see no evidence of a "lack of transparency" preventing Google doing anything in this space - they have LaMDA for example. – Neil Slater Feb 22 '23 at 14:43
The question was about "reproducing" not doing something similar. So, the main reason why one _cannot_ reproduce is usually the lack of transparency and details. Definitely, not the compute cost, as the other answer suggests, for Google. Whether Google wants to reproduce the details or not, it's another story. They could in theory do it, given enough details of how the model was built, because they have the money. – nbro Feb 22 '23 at 15:36
Anyone that tried to reproduce a paper, like me, knows that the lack of details/transparency is usually the main reason not to be able to reproduce something. Some authors even try to specify all details, but often, for different reasons, some are still missing – nbro Feb 22 '23 at 15:46
@nbro [Why do so many publishing venues limit the length of paper submissions?](https://academia.stackexchange.com/q/57071/452) doesn't help reproducibility. – Franck Dernoncourt Feb 22 '23 at 20:24
@FranckDernoncourt Yes, that's one reason. But I think there could always be a longer version that circulates around, unless the journal or publisher prohibits it. Some already do that. But if we look at the InstructGPT or GPT-3 papers, they are already very long... – nbro Feb 22 '23 at 22:22

What makes reproducing a model like GPT3/GPT3.5/ChatGPT difficult?

2 Answers2