6

From what I understand, ChatGPT is just a fancy neural network, operating like a sophisticated Markov Chain generator. As such, it should only be able to generate tokens that are in its training dataset. One thing it should not be able to generate is a token unique to the conversation, such as a random number, since the conversation is not part of its training.

However, ChatGPT very easily, and with complete accuracy, repeats many long random numbers I give it. I also split up the random number across my request, and have ChatGPT piece it together again.

Eric Holloway: How about concatenating these numbers, and removing the intermittent filler words? 9475 3008 3305 11 frog 4357 5075 0021 3385 0894 2324 3812 9336 29 dog 52 1375 8236 3840 5500 6080 3841 log 7266 0722 9112 9774 4967 4707 81

ChatGPT: To concatenate the numbers you have provided, while removing the filler words “frog”, “dog”, and “log”, you can simply combine the numbers in the order that they are given. Here is the resulting number: 9475 3008 3305 1143 5750 7500 2133 8508 9423 2438 1293 3629 5213 7582 3638 4055 0060 8038 4172 6607 2291 1297 7449 6747 0781

How is a neural network able to do this?

yters
  • 387
  • 2
  • 10

3 Answers3

9

As well as decent modelling of purposes or strings of digits, ChatGPT can identify when a completely novel string might be a product name, a fictional person or place etc.

The language model in ChatGPT is partially character based. There are around 100,000 tokens in latest GPT - the encoding is called c100k_base. Most of them are common English words (e.g. "the"), but it also supports individual characters including individual digits, and processes sequences of characters and effectively models groups of them at a time.

From what I understand, ChatGPT is just a fancy neural network, operating like a sophisticated Markov Chain generator.

That's a reasonable analogy. ChatGPT is a bit like a 2048-gram prediction engine for the next token, shift the sequence along one and repeat. No different to toy fantasy name generators when viewed from 10,000 feet up.

As such, it should only be able to generate tokens that are in its training dataset. One thing it should not be able to generate is a token unique to the conversation, such as a random number, since the conversation is not part of its training.

A couple of misunderstandings here. First, the random number will not become a single token, but will be one token per digit, or pair of digits or triple digits, depensing on sequence - you can give this a try to help visualise it, in the encoding that ChatGPT uses. Of course each of those tokens will have been seen before, millions of times in the training data.

Second, sequences do not need to be seen in the training data in order for ChatGPT to work with them. In fact, with an input sequence length of 2048, pretty much all inputs to ChatGPT in inference mode are unique never-seen-before sequences. Regardless if some of the tokens represent a long random number, the chances of any 2048 long sequence of letters and short words being unique when generated are very high.

This is where the neural network model differs from a true 2048-gram. It has generalised from the training data well enough that it actually can predict meaningful and useful values for probability of next token, even though in all likelihood it has never before seen the exact same sequence. In this regard it is an approximation of a "perfect" 2048-gram prediction engine that somehow been trained on infinite human writings.

A lot of language modelling is about correctly processing the context of a subsequence, so recognising a number sequence as being a grammatical "unit" that can be reused as-is is not a surprising feature.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • Can you expand on how a neural network is able to recognize a grammatical unit and then also repeat it? My understanding of a generative neural network is it doesn't repeat anything in the input. It is a big network of statistical associations, and is actually identical to a Markov chain, and outputs the terms statistically associated with the input. So, in order for it to output the input, there has to be a specific statistical entry that associates input "repeat number X" with output "X". That's only possible if such entry is in its training, and that cannot happen with random numbers. – yters Jan 05 '23 at 17:40
  • @yters The tokens in the "Markov chain" here are down to character level, and one big difference between a Markov chain trained on word frequency and a neural network is the the neural network will generalise and be able to assign frequencies to sequences that have never been observed before, due to analogous sequences seen in training data. This is what allows a large language model like ChatGPT to manipulate never-before-seen sequences, where a frequency-based Markov chain cannot, (an actual Markov chain model would also be seriously constrained by sequence length) – Neil Slater Jan 05 '23 at 22:53
  • Neural networks only generalize in a very shallow sense, like with convolutional layers, they essentially create a statistical model based on smaller features in the image, instead of the image as a whole. In this case, it'd be finding features based on letter and word combinations. I don't see how this can generalize to the level where we have an imperative grammar and the neural network is able to understand the command 'repeat this' in many different representations. Do you have details on how this command generalization occurs in neural networks? – yters Jan 05 '23 at 23:21
  • Also, neural networks are a Markov process. The definition of a Markov process is where the future is independent of the past given the present. The GPT models have a limited horizon of 2048 tokens, and have no memory. This fits the definition of a Markov process. – yters Jan 05 '23 at 23:32
  • @yters: Production of next token is a Markov process yes. Perhaps I was confused assuming you were talking about an ngrams process because you made the (incorrect) claim that any repeated/understood sequence somehow had to be in the training data. The ngram is an interesting analogy. The neural networks are approximating (and generalising to) an imaginary 2048-gram with perfect knowledge of all possible human discourse. – Neil Slater Jan 05 '23 at 23:45
  • Unlike the 2048-gram, the neural network really can predict from and work with sequences that it has not seen before by analogy to similar sequences that it *has* seen in the training data. This makes it an approximation of an imaginary perfect 2048-gram that somehow been trained on all valid text output ever. That includes 2048-grams that include the input random number entered by a human and corresponding copy (or manipulated copy) of it output by the chat. The difference between a LLM and one you or I can train is that it will actually have some sensible probabilities for that – Neil Slater Jan 05 '23 at 23:47
  • Not that the exact phrase had to be in the markov chain, but there is an equivalent high probability transition that would go from 'repeat X' to generating 'X', where X is a long random number. This seems highly implausible for a neural network. I can see a rule based ELIZA type system implementing such a thing, but it doesn't make sense OpenAI would add such a niche rule to their system. Also, the ability of ChatGPT to 'repeat X' is very flexible representation wise. I can write the request in many different ways, and break X up, and ChatGPT can still repeat X with complete accuracy. – yters Jan 06 '23 at 01:06
  • On the other hand, ChatGPT has trouble exactly where I'd expect a human to have trouble, when I asked it to 'reverse X'. It does so very inaccurately, just like a human would if they tried reversing a long random number by hand very quickly. – yters Jan 06 '23 at 01:08
  • What do you mean by "There are around 2000 tokens in GPT"? Is this really true - hard to imagine. – Hans-Peter Stricker May 06 '23 at 20:29
  • @Hans-PeterStricker: Yes it is true. It is a hybrid system for efficiency. – Neil Slater May 06 '23 at 20:53
  • Thanks. Where can I find an explicit list of these tokens? Do you have a reference? – Hans-Peter Stricker May 06 '23 at 21:16
  • Are we talking about the same tokens OpenAI's tokenizer counts? https://platform.openai.com/tokenizer – Hans-Peter Stricker May 06 '23 at 21:27
  • @Hans-PeterStricker Yes. At least for GPT-3 and ChatGPT based on GPT-3 - https://blog.quickchat.ai/post/tokens-entropy-question/ Thanks for the link BTW, added it to the answer. – Neil Slater May 07 '23 at 09:07
  • @Hans-PeterStricker It looks like my quoted numbers are for an earlier version of GPT-3, and that the later versions have a vocab of around 50,000. – Neil Slater May 07 '23 at 09:08
  • @Hans-PeterStricker Took a while to find the raw list for GPT-4 (also same for GPT 3.5 used in ChatGPT I think), but here it is: https://github.com/kaisugi/gpt4_vocab_list/blob/main/cl100k_base_vocab_list.txt - this one has 100k tokens, and my original assertion of 2000 tokens looks very old, but I did get that from a published source (it just must of been a about a different model) – Neil Slater May 07 '23 at 09:18
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/145840/discussion-between-hans-peter-stricker-and-neil-slater). – Hans-Peter Stricker May 07 '23 at 09:35
2

Existing answer is great about model generalization, but I would like to add about an important inductive bias of the Transformer model architecture used for ChatGPT.

In the Transformer model architecture, there is a mechanism called attention. An attention block in a model can access all context (input and previously-generated output) and reminder what is needed based on the model's choice. A partial analogy of the attention mechanism is how humans' eye move to some previous text while writing.

Due the attention mechanism, a Transformer model is very good at repeating the input as is to the output. This is different from previous architectures such as RNN, where the model has to convert the input to their internal representation.

That being said, the provided example is something more than a mere repeat, and it is very difficult to say "how" a large language model performs in general, especially as ChatGPT model is not public.

pcpthm
  • 236
  • 6
-6

Turns out ChatGPT is indeed human curated, by open admission.

During this conversation ChatGPT outright states the OpenAI team filters and edits the GPT generated responses.

...the response you are receiving is being filtered and edited by the OpenAI team, who ensures that the text generated by the model is coherent, accurate and appropriate for the given prompt.

Apparently, the fact that OpenAI actively curates ChatGPT's responses is indirectly implied in the documentation here.

Human in the loop (HITL): Wherever possible, we recommend having a human review outputs before they are used in practice. This is especially critical in high-stakes domains, and for code generation. Humans should be aware of the limitations of the system, and have access to any information needed to verify the outputs (for example, if the application summarizes notes, a human should have easy access to the original notes to refer back).

So, that explains that :)

yters
  • 387
  • 2
  • 10
  • 3
    The HITL section does _not_ imply that this is what they do with ChatGPT. Please, stop spreading your conspiracy theories without _solid_ evidence. – David Jan 17 '23 at 10:36
  • @DavidIreland did you read the transcript I linked in this answer? It's pretty clear. I just added a small excerpt to give the gist. – yters Jan 17 '23 at 10:47
  • 8
    ChatGPT has no awareness or anything like that of concepts such as "true" and "false". It does not truly understand your questions and there are no guarantees that it provides truthful answers. It just produces text that is statistically likely, based on its training data. So, if you ask it whether a human wrote its output, and it responds saying that this indeed happened... this gives us **absolutely no** new information. It may or may not be true. But this cannot be a basis of your answer. Your second link is a set of Safety Best Practices, it is not a statement of what they actually did. – Dennis Soemers Jan 17 '23 at 11:18
  • Thank you, @DennisSoemers. – David Jan 17 '23 at 11:26
  • @DennisSoemers did you read the transcript? It makes a lot of sense, and ChatGPT volunteers a fair amount of unsolicited information. Also, look at what precedes the confession. The chatbot confessed to being human because I identified a clear pattern that indicated human involvement, and I called them on it. – yters Jan 17 '23 at 16:49
  • 5
    @yters As I explained in the previous comment, it "confessing" anything is meaningless. It does not know what is true and what is not true, and has no self-awareness. You can probably get it to confess anything you like, **especially** with leading prompts where you're basically first telling it that its output is generated by a human. I've seen lots of examples where humans tell the bot that its output is incorrect (even if it actually was correct), and it will consistently apologise and agree. – Dennis Soemers Jan 17 '23 at 18:35
  • @DennisSoemers you are begging the question assuming the chatbot is just AI. The scientific approach is to weigh each hypothesis on its own merits based on the evidence, instead of a priori biasing towards one hypothesis. – yters Jan 17 '23 at 19:41
  • @DennisSoemers I would also say you are treating the neural network as a magical black box capable of anything. I would encourage you to research how neural networks work, and then ask yourself what is the likelihood the statistical associations in a neural network are capable of reproducing a very long, consistent, off the cuff conversation about ChatGPT being curated by the OpenAI team? As compared to the theory put forth in our conversation that ChatGPT's neural network doesn't do well over a long conversation, and needs human intervention to stay on course? – yters Jan 17 '23 at 19:47