Why does ChatGPT fail in playing "20 questions"?

Question

IBM Watson's success in playing "Jeopardy!" was a landmark in the history of artificial intelligence. In the seemingly simpler game of "Twenty questions" where player B has to guess a word that player A thinks of by asking questions to be answered by "Yes/No/Hm" ChatGPT fails epically - at least in my personal opinion. I thought first of Chartres cathedral and it took ChatGPT 41 questions to get it (with some additional help), and then of Kant's Critique of Pure Reason where after question #30 I had to explicitly tell ChatGPT that it's a book. Then it took ten further questions. (Chat protocols can be provided. It may be seen that ChatGPT follows no or bad question policies or heuristics humans intuitively would use.)

My questions are:

Is there an intuitive understanding why ChatGPT plays "20 questions" so bad?
And why do even average humans play it so much better?
Might it be a future emergent ability which may possibly arise in ever larger LLMs?

I found two interesting papers on the topic

The first one answers some of my questions partially, e.g. that "gpt-3.5-turbo has a score of 68/1823 playing 20 questions with itself" which sounds pretty low.

**Comments have been [moved to chat](https://chat.stackexchange.com/rooms/146287/discussion-on-question-by-hans-peter-stricker-why-does-chatgpt-fail-in-playing); please do not continue the discussion here.** Before posting a comment below this one, please review the [purposes of comments](/help/privileges/comment). Comments that do not request clarification or suggest improvements usually belong as an [answer](/help/how-to-answer), on [meta], or in [chat]. Comments continuing discussion may be removed. — Dennis Soemers, May 26 '23 at 16:38
Can you add which version you tried? I played it with GPT4, and it guessed the object (a fern) in 11 questions. — Peter, May 26 '23 at 21:33
@Hans-PeterStricker Just ”I would like you to twenty questions with me. I've thought of an object." In ChatGPT with GPT4 as a model. — Peter, May 27 '23 at 06:57

nbro · Answer 1 · 2023-05-24T15:20:06.043

49

Like any other question on why ChatGPT can't do something, the simple/superficial answer is that ChatGPT is just a language model fine-tuned with RL to be verbose and nice (or to answer like the human tuners suggested), so they just predict the most likely next token. They do not perform logical reasoning like us in general. If they appear to do it in certain cases, it's because that's the most likely thing to predict given the training data.

The more detailed answer may require some months/years/decades of research that attempt to understand neural networks and how we can control them and align them to our needs. Model explainability has been around for quite some time.

ChatGPT is really just an example of how much intelligence or stupidity you can simulate by brute-force training.

Still, it's impressive at summarizing or generating text in many cases that are open-ended, i.e. there aren't (many) constraints. Again, this can be explained by the fact that what it generates is the most likely thing given what you pass to it. Example: If you say "Always look on the bright side of...", it will probably answer with "life". Why? Because the web or the training data is full of data that has the sentence "Always look on the bright side of life".

I don't exclude it's possible to train a model to perform logical reasoning correctly in general in this way, but so far it hasn't really worked. ChatGPT can really be stupid and informationally harmful. People are assuming that there's only 1 function that computes "intelligence". Nevertheless, I think the combination of some form of pre-training with some form of continual RL will probably play a crucial role to achieve "true machine intelligence", i.e. reason/act like a human, assuming it's possible to do this.

(I've been working with ChatGPT for a few months).

edited May 24 '23 at 15:20

answered May 24 '23 at 14:45

nbro

39,006
12
98
176

3

For informationally harmful, see the professor who failed his entire class because he asked ChatGPT if it had written their final papers and it claimed to have written all of them. – A. R. May 25 '23 at 13:24
21

+1 The valid hype around ChatGPT lies in the fact that it so perfectly capable of *seeming* intelligent to the average human, combined with the near perfect verbiage it generates. Combined with the extreme confidence it exudes in every sentence. With the extremely intelligent (on the side of the developers!) structure surrounding it (i.e. how you can put it into "modes of behaviour" by telling it to behave like XYZ). It is a marvel of UX. I always find it hilarious to explain to lay persons how it is just a word generator, but that's what it is, there is zero intelligence in it. – AnoE May 25 '23 at 13:33
13

I think ChatGPT just shows that if you’re an expert at sounding like you know what you’re taking about, you sound like an expert but still aren’t. – bob May 25 '23 at 18:17
6

The way ChatGPT fails at simple arithmetic beyond a certain number of digits is illustrative of how ChatGPT is not truly 'intelligent'. – JimmyJames May 25 '23 at 18:53
@AnoE Fun fact: the difference between ChatGPT and GPT-4 is that GPT-4 is actually as intelligent as somebody could expect ChatGPT is. And, now, that is not a joke. – Volker Siegel May 26 '23 at 13:15
1

@JimmyJames I do not think that is valid reasoning. Nobody taught it to do math, and nobody asked it to do math, it is pretty interesting that it can do math at all! If you connect a human child to the internet, it will not just magically learn math by itself. – Volker Siegel May 26 '23 at 13:18
@JimmyJames one possible reason is that it uses an input alphabet of 15000 or so tokens. All individual characters are in it, and lots of word parts - and multi digit numbers. There is a token `2`- and a token `20` and `200`. So "200" can be `2`, `00` or `2`,`0`,`0` or `200`. This is an engineering decision made before it started to learn. It sees the world in a way that would make it probably too hard for us. – Volker Siegel May 26 '23 at 13:30
3

@VolkerSiegel It's not really surprising or interesting that it can do basic math up to a number of digits. There are plenty of examples of those problems for it to find. What's interesting and informative is way it fails because it demonstrates that it is not reasoning about it. If you teach a child how to add numbers up through 3 digits, you would expect them to understand how to do it with 4 and 5. It's not doing math at all. It solves the simple math problems exactly the way it 'answers' questions which is essentially by making stuff up that is similar to what it has seen. – JimmyJames May 26 '23 at 16:47
There seem to be real low level issues with handling numbers. I did not compare, but I think GPT-3.5 is better in simply counting characters than GPT-4, and both are worse that GPT-3.I am not sure enough about what reasoning is to say it is not reasoning just because it is not reasoning in the way a child would. Maybe like a child with dyscalculia. – Volker Siegel May 26 '23 at 18:11

score 14 · Answer 2 · answered May 25 '23 at 20:53

It Wasn't Trained To

A learning system performs best on the task for which it is given explicit feedback. That is the only time the parameters are updated and they are updated explicitly to maximize performance on that task. At no time did OpenAI, Google, or any other purveyor of LLMs admit to training their models on 20 Questions. The fact that it can play such games at all is a nice but unintended side effect of the model pre-training.

A human who is good at the game understands that optimal play involves bisecting the space of likely answers with each question. Without this insight, it is difficult to formulate an effective strategy that doesn't devolve to linear search. It's literally an exponential speedup. Humans who don't have this insight are also particularly bad at the game, and are likely to never reach your actual goal. So in some respects, we hold LLMs to an unreasonably high standard.

You Can Train It

On the other hand, one of the remarkable emergent behaviors is "in-context learning", meaning, you can teach the LLM something without updating any weights. Simply by describing something new, you can make it follow rules within a single "conversation" (the entire set of prompts and responses constitutes the "context"). For instance, you can teach it that a "snorglepof" is a sentence with an odd number of words that make reference to a gnome. Then you can ask it whether various sentences are a snorglepof or not, as well as ask it to produce sentences which are or are not snorglepofs (make up your own unique term/rules).

The fact that it is able to do this at all suggests to me that it has some kind of intelligence. An interesting task for you is to see if you can make it better at 20 Questions. The free ChatGPT runs on GPT 3.5 and has a context of 2048 tokens, which is a bit more or less than 1000 words (for both you and ChatGPT). If you explain the optimal strategy to it first, you might find that its performance improves relative to the naive play. For instance, you should start a new chat with something like this:

The optimal strategy for the game 20 Questions is divide and conquer. Each question should divide the space of possible answers in half. Questions which limit the size, material, and liveness of the target are typically effective. Now, let's play a game. I have thought of an object.

Even with this short prompt, I suspect that you will get better results. You can simply replay your former tests, using the exact same responses (where appropriate). If you give it example questions, it should also improve its play.

Analysis

While GPT and other LLMs appear to be super-human in their ability to manipulate language, one of their weakest areas appears to be reasoning. This is not surprising. Reasoning often requires search, which requires a potentially large amount of working memory. Unfortunately, LLMs have very little working memory (which might seem like a fantastical claim given that they consume upwards of 800 GB of RAM). The main problem is that they are almost all feed-forward architectures. Data gets a single pass through the system, and then they have to produce an answer with whatever they have.

GPT-3 has 96 transformer layers, which allows it to "unroll" a significant number of search steps that might be performed in a loop in a traditional algorithm. Even so, 96 loop iterations is pathetically small compared to something like AlphaZero, which can evaluate upwards of 80,000 board positions per second. I think it is safe to say that no amount of training will make GPT-3 competitive with AlphaZero in any game that it can play. In general, GPT-3 does poorly when it has to process something that requires a large number of operations (like adding up a long list of numbers). It is almost certainly because of this architectural choice.

Interestingly, language models prior to transformer architectures did use recurrence, which would theoretically give such models the open-ended performance horizon of systems like AlphaZero. However, they were mostly abandoned because researchers wanted the system to respond in a deterministic time, and recurrence limits the amount of parallelism which can be achieved. Perhaps future models will incorporate recurrence and get us closer to AGI. Some systems like AutoGPT attempt to add the recurrence externally to GPT, by putting it in a loop and feeding the output back into it, but they have met with quite limited (IMO, disappointing) success.

The reasoning that you can give it constraints or information is no further sign of intelligence than being able to write a computer program and computer to be able to run it. It opens up a field of possibilities and is amazing though. — akostadinov, May 25 '23 at 23:00
It is when the constraints are specified in plain English but are interpreted in the expected way and applied to subsequent actions. It is some indication that it "understands" the constraints in a way that pretty much no non-LLM system does. The fact that "prompt engineering" is even a thing should shut down a lot of the "stochastic parrot" arguments. — Lawnmower Man, May 25 '23 at 23:03
@akostadinov to spell it out a bit more, the history of AI is riddled with systems explicitly designed to play games. AlphaGo was remarkable for teaching itself how to play Go without any human experience. AlphaZero was even more remarkable for being able to learn any perfect information game. Even so, they require thousands of game playouts to learn. ChatGPT is remarkable in that you can describe a game in plain English (up to a reasonable level of sophistication) and play it in a single shot. I dare say there is no other system like it. — Lawnmower Man, May 25 '23 at 23:11
@akostadinov If you want to call game-playing "magic", sure. Not sure you will get many takers. On the other hand, nobody has a nice, crisp definition of intelligence that we can all agree on, let alone a definitive test of such, so it's easy to simply naysay any attempts at it, isn't it? Is your definition of intelligence even falsifiable? — Lawnmower Man, May 26 '23 at 03:51
I just said that your arguments don't point at "intelligence" more than "magic" or anything else you would want to prove. But I leave it to the reader to decide. wrt definition of intelligence, I think it is widely accepted that it would involve "understanding", not just juggling with data according to static or dynamic rules. It is a complicated thing to prove in general, to think that a computer has understanding of anything though shows lack of understanding of the technology from the thinker. Again I leave this to the reader. Just tried to clarify what I meant. — akostadinov, May 26 '23 at 18:07
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/146291/discussion-between-lawnmower-man-and-akostadinov). — Lawnmower Man, May 26 '23 at 18:41

score 9 · Answer 3 · answered May 25 '23 at 07:56

9

Because ChatGPT is not an artificial or synthetic intelligence, it's a large language model that possesses no intelligence in and of itself.

It's able to simulate the appearance of intelligence by tracking correlations between large numbers of objects, but it completely lacks understanding of what these correlations mean. Without understanding you cannot have reasoning, and without reasoning you cannot have intelligence.

Essentially ChatGPT, like all of the LLMs currently being hyped to death, is no more sophisticated than the chatbots we had in the 90s. Today's chatbots just happen to use much larger datasets, which allows them to more accurately simulate intelligence, but as you've already demonstrated it's child's play to shatter the illusion with any sort of questioning that requires a modicum of logical acuity.

answered May 25 '23 at 07:56

Ian Kemp

195
4

3

"simulate the appearance of intelligence" is a pleonasm, or at least one indirection too many: A program may simulate intelligence, or more neutrally, it may *appear* intelligent. *Appearing* -- that is, *behaving* -- intelligent(ly) is, since we cannot know what and how a machine thinks unless we *are* the machine (to paraphrase Turing) and are hence relying on observation, equivalent to *being* intelligent. – Peter - Reinstate Monica May 25 '23 at 10:02
2

Another funny take is "It's able to simulate the appearance of intelligence *by tracking correlations between large numbers of objects*". How else than by tracking correlations between objects do you think that *I* manage to appear intelligent, sometimes at least!? Sure, it does that by analyzing texts; but to the degree that texts are mirrors or descriptions or models of "reality", the language models indirectly learn about reality. The reason that houses have doors and windows in texts is that they have doors and windows in reality. The text corpus is to a degree isomorphic to reality. – Peter - Reinstate Monica May 25 '23 at 10:06
And in order to continue that line of thought: Only a small minority of what I believe to know and what I base my sometimes intelligent-appearing behavior on is from first-hand experience. Most of what I believe I know is from texts I have read and internalized. That is not something that distinguishes me from an LLM. – Peter - Reinstate Monica May 25 '23 at 10:13
@Peter-ReinstateMonica: Is the equation you suggest in your first comment "simulating the appearance of intelligence = appearing intelligent = intelligent"? Why not stop with "appearing intelligent"? – Hans-Peter Stricker May 25 '23 at 11:31
6

@Peter-ReinstateMonica: "What appears to be X **is** X" is not true in general. Why should it be true for intelligence? – Hans-Peter Stricker May 25 '23 at 11:32
@Peter-ReinstateMonica: I agree that "simulate the appearance" is a pleonasm. "Appearing to simulate" in turn is not a pleonasm: think of a person who suffers in a way that let's you think that he is a malingerer. You would say "he appears to be simulating". – Hans-Peter Stricker May 25 '23 at 11:36
What's your evidence for your claim that ChatGPT completely lacks understanding of what these correlations mean? How does its behavior differ from the way that it would behave if it had at least a small understanding of what these correlations mean? – Tanner Swett May 25 '23 at 11:43
2

@Hans-PeterStricker We are entering an ontological discussion here, but I would argue that "What appears to be X is X", if we exhaust all the means of observation we have in our comfy Plato's cave, **is** indeed "true" in general. E.g. we eat something, it is healthy for us, we would say it is true: This is healthy food. This is what we observe, and until the opposite is observed, we hold it to be true. Now some esoteric comes and says "but it is not healthy, it is not even food". I can only shrug: Barks like a dog, smells like a dog, it's a dog for me. – Peter - Reinstate Monica May 25 '23 at 11:49
2

@TannerSwett ChatGPT happily provides completely wrong answers to questions, and does so **confidently**. [Stack Overflow banned it for exactly this reason.](https://meta.stackoverflow.com/questions/421831/temporary-policy-chatgpt-is-banned) If it understood the material it was being asked about, it logically follows that it would not return incorrect answers. Unless you want to claim that it's programmed to do that, which would somewhat go against the entirety of what the company selling it is promising... – Ian Kemp May 25 '23 at 14:28
2

@Peter-ReinstateMonica you seem to be deliberately ignoring the second part of my statement regarding understanding. Correlation without causation is just correlation. As for `Most of what I believe I know is from texts I have read and internalized. That is not something that distinguishes me from an LLM`, that's a non-argument because you are able to understand the text beyond the words it's comprised of; an LLM is not. – Ian Kemp May 25 '23 at 14:33
7

I agree ChatGPT isn't the end-all-be-all that many articles seem to suggest, but I also think you don't give it enough credit by saying it's basically just a 90s chatbot with more data. The underlying model is a complex neural network that would have been impossible to train using 90s hardware. The language model it uses was simply not technically tractable 30 years ago, it's not a matter of just dumping more data into a decades-old method. This seems like saying modern computers are no more sophisticated than 90s computers, merely because they still fail from time to time. – Nuclear Hoagie May 25 '23 at 15:14
1

@NuclearHoagie I agree. Saying that current LLMs are as sophisticated as 90s chatbots is ridiculous and not understanding the math behind those models – Lamak May 25 '23 at 15:50
@NuclearHoagie I don't care about the technology that these LLMs are built with, I care about what they can or can't do. And if they can't give correct answers 100% of the time, or admit when they can't give a correct answer, then they are fundamentally as reliable, and therefore useful, as a 90s chatbot. A clock that is arbitrarily wrong at a different time each day would be discarded as not fit for purpose, yet far too many people are treating LLMs as if they're the next coming of $diety... they're not, they're just this decade's Silicon Valley buzzword to rake in venture capital. – Ian Kemp May 25 '23 at 16:03
3

@IanKemp I am baffled by the notion that you would consider nothing short of perfect accuracy to be "no more sophisticated". ChatGPT is generally far *more* accurate that what was available in the 90s. By this argument, an otherwise perfect AI that gives a *single wrong answer* is "just a 90s chatbot". Would you say that modern medicine is no more sophisticated than the Dark Ages, merely because we can't cure 100% of diseases? – Nuclear Hoagie May 25 '23 at 16:14
@NuclearHoagie There also a much larger internet corpus now than in the 90s. Actually, a version of ChatGPT based on the 90s internet would be pretty interesting to play with. – JimmyJames May 25 '23 at 18:49
3

@NuclearHoagie GPS has been around since the 70's. It is far more accurate today than it was then. Its still fundamentally the same thing. The first Ford Model-T is fundamentally the same as a Bugatti Chiron Super Sport. Are there massive technology changes in them? Yes, but they are built upon the same fundamental invention. Chatbots were invented in the 60's. ChatGPT biggest difference from a chatbot in the 70's is it is trained instead of hard-coded. It is less different from a chatbot from the 90s. – David S May 25 '23 at 19:40
@Hans-PeterStricker If X is the ability to do some thing or set of things, then the appearance of X is X. Generally, people think of intelligence as the ability to do things, so its appearance is it. There's some good evidence that human intelligence is the ability to do things and does not not include such things as a sense of competence or an understanding. For example, the man who lost the ability to form long-term memories who then learned how to solve a rubik's cube and yet has no experiences of understanding or competence associated with this ability. – David Schwartz May 25 '23 at 19:52
@DavidS "Fundamentally the same" and "no more sophisticated" are not the same thing, and I'm only arguing against the second. It's downright absurd to claim there has never been and will never be a gas-powered car more sophisticated than a Model T, merely because they all rely on the same fundamental invention of the internal combustion engine. – Nuclear Hoagie May 25 '23 at 19:54
@NuclearHoagie When was the last time you expected penicillin to give you an answer? Never? Good, then stop trying to compare apples to Antarctica. Also, good job on arguing against your own strawman, because I very specifically began the sentence you so object to with the word _essentially_. – Ian Kemp May 25 '23 at 21:23
@IanKemp I didn't want to drag this out but I can't resist: What you say in your long comment 12 hrs ago containing "if they can't give correct answers 100% of the time" applies word by word to humans as well (for example, humans are arbitrarily wrong at different times of the day). That is not an argument against a passed Turing test (all to the contrary), not an absolute argument against fitness for certain purposes, and not an argument against a new quality in these new LLMs which distinguishes them from previous models or other attempts at A.I. (But yes, they are *also* buzz ;-) ). – Peter - Reinstate Monica May 26 '23 at 05:11

score 0 · Answer 4 · answered May 27 '23 at 16:20

ChatGPT and the rest of LLMs do not have an understanding of any world concept, entity nor the relationship between them. As mentioned they use brute-force training to produce text.

Any ever larger LLMs following the same design (brute-force training to produce text) will show the same problems, issues... due to their lack of knowledge of the world.

Why does ChatGPT fail in playing "20 questions"?

4 Answers4

It Wasn't Trained To

You Can Train It

Analysis