For questions related to GPT (which stands for Generative Pre-Training), which is a combination of transformers (proposed in "Attention is All You Need") and unsupervised pre-training for solving language tasks, such as machine translation. GPT was proposed in "Improving Language Understanding by Generative Pre-Training" (2018) by Open AI. There's also GPT-2, which was proposed in "Language Models are Unsupervised Multitask Learners" (2019) by Open AI.
Questions tagged [gpt]
77 questions
27
votes
4 answers
Why is ChatGPT bad at math?
As opposed to How does ChatGPT know math?, I've been seeing some things floating around the Twitterverse about how ChatGPT can actually be very bad at math. For instance, I asked it "If it takes 5 machines 5 minutes to make 5 devices, how long would…

Mithical
- 2,885
- 5
- 27
- 39
27
votes
1 answer
What is the "temperature" in the GPT models?
What does the temperature parameter mean when talking about the GPT models?
I know that a higher temperature value means more randomness, but I want to know how randomness is introduced.
Does temperature mean we add noise to the weights/activations…

Tom Dörr
- 393
- 1
- 3
- 7
26
votes
1 answer
What exactly are the "parameters" in GPT-3's 175 billion parameters and how are they chosen/generated?
When I studied neural networks, parameters were learning rate, batch size etc. But even GPT3's ArXiv paper does not mention anything about what exactly the parameters are, but gives a small hint that they might just be sentences.
Even tutorial…

Nav
- 481
- 1
- 5
- 10
16
votes
2 answers
Why does GPT-2 Exclude the Transformer Encoder?
After looking into transformers, BERT, and GPT-2, from what I understand, GPT-2 essentially uses only the decoder part of the original transformer architecture and uses masked self-attention that can only look at prior tokens.
Why does GPT-2 not…

Athena Wisdom
- 311
- 2
- 5
10
votes
1 answer
How does the (decoder-only) transformer architecture work?
How does the (decoder-only) transformer architecture work which is used in impressive models such as GPT-4?

Robin van Hoorn
- 1,810
- 7
- 32
8
votes
2 answers
Is GPT-4 based on GPT-3 or was it trained from the scratch?
To me it looks like GPT-4 is based on GPT-3.
On the other hand, there were rumors that training of GPT-3 was done with errors, but re-train was impossible due to the costs.

Anixx
- 301
- 8
7
votes
2 answers
What is the difference between the positional encoding techniques of the Transformer and GPT?
I know the original Transformer and the GPT (1-3) use two slightly different positional encoding techniques.
More specifically, in GPT they say positional encoding is learned. What does that mean? OpenAI's papers don't go into detail very much.
How…

Leevo
- 285
- 1
- 9
7
votes
1 answer
How do we know if GPT-2 is a better language model?
You may have heard of GPT2, a new language model. It has recently attracted attention from the general public as the foundation that published the paper, OpenAI, ironically refused to share the whole model fearing dangerous implications. Along the…

Lucas Morin
- 232
- 2
- 11
6
votes
5 answers
How is GPT 4 able to solve math?
How can GPT 4 solve complex calculus and other math problems. I believe these problems require analytical reasoning and ability to compute numbers. Does it still use a LLM to complete this process or does it add on to this?
Here is the link to the…

desert_ranger
- 586
- 3
- 19
5
votes
2 answers
Where can I find pre-trained language models in English and German?
Where can I find (more) pre-trained language models? I am especially interested in neural network-based models for English and German.
I am aware only of Language Model on One Billion Word Benchmark and TF-LM: TensorFlow-based Language Modeling…

Lutz Büch
- 161
- 7
5
votes
1 answer
How does GPT-based language model like ChatGPT determine the n-th letter of a word?
I understand that GPT models process input text by converting words into tokens and then embedding vectors and do not process them letter by letter. Given this approach, I am curious to know how a model like ChatGPT can identify the first (or n-th)…

Peyman
- 534
- 3
- 10
5
votes
2 answers
How is the next token predicted in transformers?
In the transformer (or GPT/decoder only), at the end of the decoder blocks but before the final linear layer you have X vectors (for the X tokens at the input of the decoder). We then want to compute the probabilities for the next token of the…

Miguel Carvalho
- 51
- 1
5
votes
1 answer
What can GPT-4 do linguistics-wise?
I have no access to GPT-4, but I wonder whether it can do the following (where ChatGPT failed).
Make syntactic and morphological analysis of sentences in a language like Russian, marking cases, parts of speech and sentence, conjugations of verbs,…

Anixx
- 301
- 8
5
votes
1 answer
Is the Mask Needed for Masked Self-Attention During Inference with GPT-2
My understanding is that masked self-attention is necessary during training of GPT-2, as otherwise it would be able to directly see the correct next output at each iteration. My question is whether the attention mask is necessary, or even possible,…

D_s
- 51
- 3
4
votes
2 answers
What sort of computer would be necessary to run queries on a LLM?
I've heard that to train a model like GPT 4.0 you need a very powerful computer and ~$10M of computing power, but once you've produced the trained ~570GB model, what sort of computing power is necessary to execute specific queries with it?

ak0000
- 195
- 1
- 8