How much do we know about the architectures of the Codex (prototype) models?

Question

The transformer model Codex by OpenAI was introduced in a 2021 paper. The paper does not give complete information about the architecture. Below I've quoted all the passages in the paper that give hints as to the architecture:

...we hypothesized that a specialized GPT model, called Codex, could excel at a variety of coding tasks. This paper describes several early Codex models, whose descendants power GitHub Copilot and the Codex models in the OpenAI API.

We fine-tune GPT models containing up to 12B parameters on code to produce Codex.

...we hypothesized that it would be beneficial to fine-tune from the GPT-3 (Brown et al., 2020) model family...

A table in the paper (Table 1) lists various Codex models studied in the paper, giving only the number of parameters for each one: 12M, 25M, 42M, 85M, 300M, 679M, 2.5B, and 12B.

So, what we can glean from this is: the models discussed in the paper are not the production Codex model that powers Copilot, but prototypes (so the production model could in theory be completely different - that's fine, I'm not asking about that one). A number of different size versions of the model are studied in this paper, and the only description of their architecture is that they are "GPT models".

I'm not sure if this is underspecified, or if I just don't know the field well enough. Saying a model is "a GPT model" does not seem to specify the architecture uniquely, to me. I know I can go read the GPT-3 paper for more information, but at minimum if you say something is say a 300M parameter "GPT-3 model", it seems to me I still don't know how many layers there are, how many attention heads, etc.

Can we deduce more about the shapes of these models? At least the number of layers and parameters per layer?

How much do we know about the architectures of the Codex (prototype) models?

0 Answers0