Is GPT-4 based on GPT-3 or was it trained from the scratch?

Question

To me it looks like GPT-4 is based on GPT-3.

On the other hand, there were rumors that training of GPT-3 was done with errors, but re-train was impossible due to the costs.

are you thinking of the "glitch tokens"? Those can be fixed without re-training. — user253751, Mar 16 '23 at 22:39
[Glitch tokens](https://www.youtube.com/watch?v=WO2X3oZEJOA) - interesting stuff! If this is the bug you were talking about, it doesn't require re-training. — user253751, Mar 17 '23 at 00:14

score 6 · Answer 1 · edited Mar 22 '23 at 08:35

6

GPT-4 is largely based on GPT-3. From the GPT-4 Technical Report:

GPT-4 is a Transformer-style model [39]

The transformer-style model originates from the paper Attention Is All You Need, which lays the foundation of GPT, GPT-2, and GPT-3.

However, there is one significant change: GPT-4 accepts images as inputs. This has been observed in the paper Learning Transferable Visual Models From Natural Language Supervision by OpenAI. We can safely predict that GPT-4 is based on GPT-3 + CLIP.

I say predict because the authors decided not to publish the architecture or anything:

Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

However, we have no details regarding whether it is trained from scratch, or use any technique.

edited Mar 22 '23 at 08:35

Volker Siegel

589
1
4
17

answered Mar 19 '23 at 01:36

Minh-Long Luu

1,120
2
20

But did they re-train from scratch it or not? – Anixx Mar 19 '23 at 11:14
`this report contains no further details about [...] training method` – Minh-Long Luu Mar 19 '23 at 11:35
@Anixx As I already said in the removed answer, GPT-4 is just a retrained GPT-3 with more fine tuning. It's a matter of what has changed in the training data set. For example, it seem that the previously mentioned *`glitch tokes`* have now been removed, which is likely an effect of removing dumb scraping data, such as from the *counting* sub-reddit, or from scraping random debug log files, etc. In other words, the training is done from scratch, but it's likeley to be done on mostly the same data, adjusted to remove problem points, as discovered by humans from previous version. – not2qubit Mar 19 '23 at 12:42
@not2qubit *glitch tokens* are an effect of having tokens in the token set which were not, or rarely, in the training data, as they result from spam data which was removed before training, but after selecting the tokens. They can be removed from the token splitter which converts text into tokens without needing to re-train. – user253751 Mar 21 '23 at 22:00
@not2qubit `GPT-4 is just a retrained GPT-3 with more fine tuning` is this what you *think* or it is something the authors say? Why does removing the glitch token infers GPT-4 is just a fine-tune GPT-3? – Minh-Long Luu Mar 22 '23 at 04:13
@not2qubit The glitch tokens are not actually a problem in the transformer, it is a problem with building the set of tokens that can be used for learning input. I think it is enough to literally remove the glitch tokens from the set of valid tokens. I expect the set of tokens may be the same, apart from that. GPT-3 uses the same tokens GPT-2 uses. – Volker Siegel Mar 22 '23 at 14:14
@not2qubit it is quite possible that the training data did not change at all. The glitch tokens can be fixed without cleaning the data and retraining. – Volker Siegel Mar 22 '23 at 14:17

Volker Siegel · Accepted Answer · 2023-03-22T14:06:30.553

4

GPT-4 is a transformer like GPT-3 and any other GPT. The training is certainly new, because it has a different size, you just can not transfer GPT-3 weights into an GPT-4 to continue training.

The details of the implementation are currently not known, the published report about it is not a scientific paper about how GPT-4 works, it basically states "we do not say how it works", see below.

The size of various components certainly differ, at least the number of weights in one part, but it could also be larger everywhere. The prompt length (context size) and the maximal output size increased a lot, from 4000 tokens to 8000 or even 32000 tokens. I would expect there are some other minor differences.

Other major differences can be the amount of training data, and the compute used for training. My personal speculation is that GPT-4 used a lot more compute for training, with possibly a similar amount of training data as GPT-3.

From the "GPT-4 Technical Report", section 2:

Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

edited Mar 22 '23 at 14:06

answered Mar 22 '23 at 08:23

Volker Siegel

589
1
4
17

1

Saying that GPT-3 is just a bigger GPT and GPT-2 is correct, but saying GPT-4 *is* the same as GPT-3, conceptually, I believe is incorrect. You cannot feed an image like a text without processing steps, such as like ViT's way. That makes GPT-4 differ from GPT-3, albeit small. – Minh-Long Luu Mar 22 '23 at 08:38
Yes, I added discussion of the differences. – Volker Siegel Mar 22 '23 at 14:07
"you just can not transfer GPT-3 weights into an GPT-4 to continue training" - actually, you could, as long as what you transfer is compatible in mathematical terms (i.e. you can still multiply the matrices/tensors). In fact, a form of transfer learning is based on the idea of freezing some weights then connecting a new compatible set of weights. It's very well possible that they are doing something like that or some form of model compression, as training from scratch may take a long time. – nbro May 04 '23 at 20:34
@nbro If GPT-4 would be constructed to be kind of an extension of GPT-3, that would be possible, and very interesting. But that would require very different CNNs, I think you can not just hammer a small transformer into a larger one. We do not know what they actually did, but trying to build GPT-4 based of a learned model of a smaller CNN sounds like big scientific progress we would know of. Hmm... But then, what would happen when you just take the weights you have from an old smaller model, and use the as initialization as the beginning, together with 20 times the amount of noise or zeros? – Volker Siegel May 04 '23 at 23:14

Is GPT-4 based on GPT-3 or was it trained from the scratch?

2 Answers2