Why is ChatGPT bad at math?

Question

As opposed to How does ChatGPT know math?, I've been seeing some things floating around the Twitterverse about how ChatGPT can actually be very bad at math. For instance, I asked it "If it takes 5 machines 5 minutes to make 5 devices, how long would it take 100 machines to make 100 devices?"

The conversation went like this:

Mithical: If it takes 5 machines 5 minutes to make 5 devices, how long would it take 100 machines to make 100 devices?
ChatGPT: If it takes 5 machines 5 minutes to make 5 devices, then it would take 100 machines 100 minutes to make 100 devices.
Mithical: That's not right. Check carefully and try again.
ChatGPT: I apologize if my response was not correct. Can you please provide more information or clarify your question? I may have misunderstood the scenario you described.
Mithical: It takes a machine 5 minutes to make a device. How long would it take 100 machines to make 100 devices?
ChatGPT: If it takes a single machine 5 minutes to make a single device, then it would take 100 machines 5 minutes to make 100 devices. Therefore, it would take a total of 500 minutes for 100 machines to make 100 devices.

Now, unless I'm misunderstanding something here, the answer is 5 minutes. ChatGPT first makes the intuitive mistake of 100, that a human might make as well, and then goes on to (correctly, as far as I understand) say it's 5 minutes... but concludes in the same response that it's then 500 minutes.

Why is this AI so bad at math? Computers are generally supposed to be good at math. Why does this model make such simple logical mistakes?

computers are good at maths because the rules of mathematics are hard coded into them. what you're asking here is why is a learned model bad at maths -- this is an apples to oranges comparison. just because the model _runs_ on a computer, does not mean that it is comparable to the computers you refer to which are good at maths (presumably things like programming languages where, as I say, maths is hard coded into them) — David, Dec 09 '22 at 09:49
Computers' ALUs (arithmethic logic units) are extremely good and fast at arithmetic operations. AI models are not ALU. They are good at guessing, but they cannot perform standard computer algorithms, and are not guaranteed to give correct results. — LatinSuD, Dec 09 '22 at 15:09
This is generally wrong. Computers do not do math. They do calculations and execute algorithms. Math is making theoretical abstract statements and proving them. Logic and automatic proof systems do a part of this if presented strictly encoded input, but again this is the computer following an algorithm, there is no innate understanding. — Lutz Lehmann, Dec 17 '22 at 10:15
What is asked about here is a third task, word problems. Take input in natural language, understand the described situation and build a model for it, then apply the model to a slightly different situation. The bot fails at the third step, as it was only constructed for the first two steps, for any convenient interpretation of "understand". — Lutz Lehmann, Dec 17 '22 at 10:20
I found that it can find sine of some irrational number with very goot precision of up to 10 digits and more if requested, but cannot correctly multiply two 4-digit numbers... This is weird. — Anixx, Dec 26 '22 at 14:32
I tried simple math with it. It is bad.: Just 10 numbers with 2 decimals and two behind the 0. Like 94.93. So i asked it to add all 10 and he was off by one hundred — TE55, Jan 01 '23 at 22:10
@LatinSuD The issue isn't that AI models aren't ALUs; this is irrelevant, IMO. A human isn't an ALU either, and a human can solve this problem easily, while an ALU has no clue how to solve this problem. The issue is that the AI model cannot internalize the problem in a way that makes it easily solvable, for example by translating it to a system of equations. — HelloGoodbye, Jan 05 '23 at 20:01
Same thing, I asked 127+46+96+50+50+85+46+82+68+153+60+92+18=? In the first chat it tells 847 and I told him it's 973. He learned the answer. And I created another chat. 127+46+96+50+50+85+46+82+68+153+60+92+18 =861 now... — ShadowAI, Jan 10 '23 at 00:53
I have upped your comment @LutzLehmann (as it is true and insightful), but the word 'math' is ambiguous. What the mathematicians call 'math' is what you say. But generally, people label 'calculations' also 'math'. See other comments. For people wondering what GPT3 is good at and not: https://arxiv.org/pdf/2005.14165.pdf (i.e. very bad at 'simple math' but also other things), this is a paper by OpenAI itself. — gctwnl, Jan 11 '23 at 08:45
@Mithical "Why is this AI so bad at math? Computers are generally supposed to be good at math." This is like asking: "If the matter in the human brain is so good at obeying the laws of physics then why are some humans bad at physics?". The fact is computers basic computational ability isn't used to directly solve math problems that you ask chatGPT it is used to evaluate the transformer network & receive its outputs. But some people have had the idea to 'give transformers a calculator' (e.g. Meta's Toolformer) to make it better at arithmetic, good idea if you ask me. — profPlum, May 07 '23 at 21:28
Note that ChatGPT has improved since the question was asked. Now it answers with " _If it takes 5 machines 5 minutes to make 5 devices, then each machine takes 5 minutes to make one device. Therefore, 100 machines can make 100 devices in the same amount of time it takes 1 machine to make 1 device, which is 5 minutes. So, it would take 100 machines 5 minutes to make 100 devices._ " — Dmitry Grigoryev, Jun 16 '23 at 14:03
Highly outdated question, given that ChatGPT4 gives the correct answer and in fact excels at math, see e.g. benchmarks here https://arxiv.org/abs/2306.08997. And the latest version now can also fire up the Wolfram Alpha plugin to solve maths problems when it needs to. Though for this simple problem it wouldn't use it - it would solve that just with it's LLM logic. But I had cases where I gave a verbal problem statement, ChatGPT4 wrote down the correct differential equation system for it & then solved it with Wolfram Alpha. It's pretty good at maths now! — Tom Wenseleers, Jun 16 '23 at 17:01
This [report](https://12000.org/my_notes/CAS_integration_tests/reports/summer_2022/test_cases/11_MIT//report.htm) compares chatGPT 3.5 against 8 other known computer algebra programs (such as Mathematica, Maple, etc..) for solving over 300 integration problems from MIT tests. As you can see from the result, chatGPT scored bad at integration. Only solved 14% of the problems. While the next one up (Sympy) scored 82%. Mathematica did best at 99.38%. — Nasser, Jun 19 '23 at 13:59
I've ran into this same issue: ChatGPT isn't really designed to do elementary school and middle school math homework problem solving. It keeps apologizing and looping the same bad responses in these scenarios. If you were asking me these questions, I would keep asking you more questions until I could comfortably answer you or just give up. There's absolutely nothing simple about the "device" concept: it means totally different things to different people in the I.T. industry. You weren't asking it questions in the format of a math equation, which is something all computers do well. — thinksinbinary, Jul 02 '23 at 16:07

score 27 · Answer 1 · answered Dec 22 '22 at 01:52

chatGPT is able to create well-formed sentences which contain phrases that are fitting for the input. It has rules extracted from its data, but those are not rules of understanding, but rules of 'likely phrases'.

In the most simple variant, every smartphone already has that mechanism. You type on your screen and the phone gives you three words that you would statistically use most often after the previous one. On the phone that is not of a very high quality. Choose the most likely words for a message and the sentence that results is something like (just started and chose a stream of 'middle options'):

"I am going on a trip with my friends to get to the full house for the weekend so we will need a time for a quick dinner with us at all and we can go over and there for the weekend if we want you guys out there"

You can see how many words have a bit of a statistical likelihood relation backwards time for, for a, time -> quick, quick dinner, with us, with -> at all.

chatGPT is fundamentally the same but much better than what the phone can do. Its statistics are much 'deeper'. It is the phone's autocompletion on steroids. But 'well-formed' and 'fitting' do not equate 'reliability' nor 'trustworthiness'. More data and training will not help, it will become even more 'well-formed' and 'fitting', but not more reliable.

'Well-formed' and 'fitting' do suggest to humans reliability and intelligence strongly, but that is more because we humans judge quickly and superficially. It's not so much that chatGPT is intelligent, but more that we are less so that we think.

What is noteworthy is that many humans make the same 5/5/5 100/100/100 error. The difference is that you can explain to humans where their quick-and-dirty answer goes wrong. That is because they do have that understanding that chatGPT has not.

This isn't entirely true. ChatGPT can generate correct C code that compiles and runs, based on my questions. If it makes a mistake, I merely point out that a mistake was made by saying something abstract like "But then I have to know the type ahead of time," and it finds the mistake on it's own, and offers a correction that will compile. It makes deductions that are far beyond "make plausible English sentences." — SO_fix_the_vote_sorting_bug, Jan 10 '23 at 07:19
@SO_fix_the_vote_sorting_bug I believe that's because to ChatGPT, a C language question is still just a "language learning" question. It's still a souped up phone prediction algorithm, just one that has seen a lot of C code and knows "what comes after this". Incidentally, ChatGPT is banned from Stack Overflow precisely because its answers often look superficially correct, but can be fundamentally wrong, because it doesn't actually have an understanding. Or a compiler to check itself with. — JamieB, Mar 21 '23 at 18:59
@SO_fix_the_vote_sorting_bug In my experience, this is not true in general. Language models generate plausible code that is sometimes correct, sometimes wrong in subtle ways, and sometimes ridiculously wrong. I've had ChatGPT make blatant errors, and when corrected, apologize and confidently give a new answer, which is also totally wrong. — LarsH, Jun 19 '23 at 10:49
Generally this is a good answer, but not "It's not so much that chatGPT is intelligent, but more that we are less so that we think." Fitting answers suggest intelligence to humans because they are strongly correlated in human speech and writing. We've not had experience before with entities that could could produce fitting output without intelligence or understanding. — LarsH, Jun 19 '23 at 10:52

score 10 · Answer 2 · answered Dec 18 '22 at 12:21

(Check out my heavily related answer to a similar question here)

Why is ChatGPT bad at math, while it is very good at other stuff?

The problem comes down to the age-old problem of learning vs understanding. On a high level, your question is very philosophical. The problem is that the model learns everything present in the data. However, throw it a slightly different mathematical problem it has not seen before, and it fails to answer spectacularly easy questions.

One of the best examples are papers that try to make mathematics AIs. They throw it an endless amount of data for simple addition, but throw it a slightly higher number that it has never seen before, and it cannot solve it.

These AIs learn their data well, but they cannot extrapolate strict 'rules' from the data. Addition is a simple 'rule', but it cannot comprehend this rule. Neural networks cannot learn rules simply because they are not designed to do so. They are multiplying and adding up inputs, weights and biases, and they are not designed to learn 'if, else' logic.

An additional train of thought which i very much like is the following: Think of how precise a super small multi-layer perception would have to be to implement simple 2 digit addition (given 4 inputs). Should be doable right! Now think of how you would have to learn the weights of a model with 1.000.000 parameters to do the exact same, using a large dataset and a NN optimizer. It would just start learning how to approximate the answers, but it will never converge to a state where it learns the rule of addition.

Boris · Answer 3 · 2023-01-06T15:49:39.543

ChatGPT is good at math and can understand the logic. It can derive new conclusions on its own and generate value which was not there before. To make use of ChatGPT you have to provide it with the right context and questions. This applies to math but also many other areas.

I started with the same task, 5 machines/5 minutes/5 devices. At first, it was able to provide correct answer. Then I changed '5' to '3' (3 machines/3 minutes/3 devices) and it failed. But then I provided it with the context, asked a question differently, and it was able to understand the problem, and even create a general rule for solving it.

Here is how (link to the chat)...

Initial question, answered correctly. Most likely, in the meantime, this was added to its training data. The original question was posted like a month ago.

B: If it takes 5 machines 5 minutes to make 5 devices, how long would it take 100 machines to make 100 devices?

ChatGPT: If it takes 5 machines 5 minutes to make 5 devices, it would take 1 machine 5 minutes to make 1 device.

If it takes 1 machine 5 minutes to make 1 device, it would take 100 machines 5 minutes to make 100 devices.

Then I changed the question in a trivial way (3 instead of 5). At first it failed to get it:

B: If it takes 3 machines 3 minutes to make 3 devices, how long would it take 100 machines to make 100 devices?

ChatGPT: It is not possible to determine how long it would take 100 machines to make 100 devices based on the information provided. The relationship between the number of machines and the time it takes to produce a certain number of devices is not necessarily linear. In order to determine how long it would take 100 machines to make 100 devices, we would need to know the efficiency of the machines and how they work together.

But then:

B: If it takes 5 machines 5 minutes to make 5 devices, how long would it take 100 machines to make 100 devices?

ChatGPT: If it takes 5 machines 5 minutes to make 5 devices, it would take 1 machine 5 minutes to make 1 device.

If it takes 1 machine 5 minutes to make 1 device, it would take 100 machines 5 minutes to make 100 devices.

B: using the same logic as in the previous answer please analyze similar problem and provide the answer.

If it takes 3 machines 3 minutes to make 3 devices, how long would it take 100 machines to make 100 devices?

ChatGPT: If it takes 3 machines 3 minutes to make 3 devices, it would take 1 machine 3 minutes to make 1 device.

If it takes 1 machine 3 minutes to make 1 device, it would take 100 machines 3 minutes to make 100 devices.

Then I went further and was able to get more from ChatGPT, it was able to provide new value in the form of general rule for the class of problems. Many humans would cut short on this step ;D ...

B: using the same logic, please formulate general rule to solve this type of a problem

ChatGPT: Here is a general rule that can be used to solve this type of problem:

Let X be the number of machines, Y be the number of minutes it takes for X machines to make X devices, and Z be the number of devices that need to be made.

To determine the number of minutes it would take Y machines to make Z devices, divide Z by X and multiply the result by Y.

For example, if X = 5, Y = 5, and Z = 100, the number of minutes it would take 5 machines to make 100 devices can be calculated as follows: (100/5) * 5 = 20 minutes.

This rule can be expressed in the following formula:

(Z/X) * Y = number of minutes it would take Y machines to make Z devices

I hope this helps! Let me know if you have any questions or if you would like further clarification on this rule.

Its final formula is correct, "(Z/X) * Y = number of minutes it would take Y machines to make Z devices".

But no one is perfect, it has 2 small mistakes there.

1st, "To determine the number of minutes it would take Y machines to make Z devices, divide Z by X and multiply the result by Y." It should have used another letter, like W, to avoid confusion.

2nd, example " (100/5) * 5 = 20 minutes." is just wrong, result is 100 minutes.

Nevertheless, those small mistakes dont make any difference. It can now solve advance varaint of the question (chellange any human with this :D )

B: If it takes 3 machines 3 minutes to make 3 devices, how long would it take 7 machines to make 110 devices?

ChatGPT: If it takes 3 machines 3 minutes to make 3 devices, it would take 1 machine 3 minutes to make 1 device.

If it takes 1 machine 3 minutes to make 1 device, it would take 7 machines 3 minutes to make 7 devices.

To determine the number of minutes it would take 7 machines to make 110 devices, we can use the formula (Z/X) * Y = number of minutes it would take Y machines to make Z devices, where X is the number of machines, Y is the number of minutes it takes for X machines to make X devices, and Z is the number of devices that need to be made. Plugging in the values, we get:

(110/7) * 3 = 45 minutes

Therefore, it would take 7 machines 45 minutes to make 110 devices using the production system described.

Again, not perfect, it rounded up 110/7 to 15. Should keep it at 15.7 exactly. But ok, still makes sense. You can give it further instructions, e.g. to not round calculations for this type of problem and it will comply going forward.

My take on this: we should all learn how to interact with this thing, its gonna be very important in the near future!

You need to repeat these experiments a *lot*. Remember what you are getting are *samples* of what the model predicts the next bit of text will be. It does a *kind of* maths up to some limit *in order to predict the next text* because in order to represent many similar problem-and-answer examples in its training data it discovered a procedural thing it could perform over timesteps that allowed better probability of getting the right text. But it also learned from examples like your second one that sometimes this is not possible. The output is stochastic, and 3 examples is not nearly enough — Neil Slater, Jan 06 '23 at 14:32
Based on my personal experience with ChatGTP plus, this is not the issue. ChatGTP told me that 2x0.85 = 2.55. Then when I corrected it, ChatGTP recalculated it to 1.7, while failed to calculate (1000 + 10) / (2 x 0.85 ) and provided result of 588.24 while the real result should be 594.117647. So as for now, I consider ChatGTP not suitable for any calculations. — Ernest Sawyer, Mar 05 '23 at 08:50
The optimism in this answer is brightening and reminds me of the early enthusiasm that welcomed the Eliza chatbot in the sixties. If you really want to believe that ChatGPT can do math, then sure, ChatGPT can do math. It can also write your horoscope. — Stef, Mar 31 '23 at 15:22
"ChatGPT is good at math and can understand the logic." This assertion is not supported by the data. You might as well say that a textbook on calculus can understand integrals. — LarsH, Jun 19 '23 at 14:22

score 2 · Answer 4 · answered May 18 '23 at 08:54

ChatGPT's GPT-4 model does not fall for this trap anymore due to more extensive training.

I tested with two prompts:

Prompt: If it takes 5 machines 5 minutes to make 5 devices, how long would it take 50 machines to make 50 devices?

GPT-4: This is an example of a classic problem that's meant to test understanding of rates. The key to solving it is recognizing that each machine can produce one device in five minutes.

Given that, if you have 50 machines working simultaneously, they can produce 50 devices in 5 minutes. This is because each machine is working independently to produce its own device, so the overall rate of production increases linearly with the number of machines. Thus, the answer is 5 minutes.

Prompt: And if it takes 3 machines 3 minutes to make 3 devices, how long would it take 70 machines to make 70 devices?

GPT-4: The principle is the same in this case. Each machine can produce one device in three minutes. So, if you have 70 machines working simultaneously, they can produce 70 devices in 3 minutes. This is because each machine is working independently to produce its own device, so the overall rate of production increases linearly with the number of machines. Thus, the answer is 3 minutes.

As I explained at the question How does ChatGPT know math?, the apparent ability of LLM models (in particular the most capable GPT-4 model) to solve certain mathematical problems is due to the amount of training and the amount of parameters. These ML models do not have explicit rules for solving mathematical problems.

The following 2022 paper describes that such capabilities of transformer-based language models occur (i.e. emerge) when a certain threshold of parameter and training (FLOPs) quantity is exceeded: https://arxiv.org/pdf/2206.07682.pdf

Welcome to AI-SE. I'm not seeing how this late answer contributes to what was already stated in the other answers. Although GPT-4 might not fall for the same mistake, it is not the intent of the question, which is much more high-level. — Robin van Hoorn, May 18 '23 at 09:41
@RobinvanHoorn The question was "Why is this AI (ChatGPT) so bad at math?" I added the current response of GPT-4, which shows progress with the specified case, while explaining that LLMs don't have mathematical reasoning rules, they were not built for such problems. What is very interesting from the article that I referenced is the fact that the LLM approach, given enough parameters, can develop even the ability to imitate mathematical reasoning in many cases. — LeRobert, May 19 '23 at 15:41

Why is ChatGPT bad at math?

4 Answers4

Linked

Related