No, LLMs Still Cannot Reason - Part II
The follow-up you didn't know you needed—and neither did I...
You are reading Mostly Harmless AI, a section of Mostly Harmless Ideas where we explore the capabilities and limitations of Artificial Intelligence systems. Subscribe for free to get all future updates in your inbox.
This article is part of my upcoming book How to Train your Chatbot, a handbook for understanding LLMs and using them effectivley to build all sorts of cool stuff. You can get it today in early access.
In a previous article, I claimed LLMs cannot reason, and the Internet exploded. I was quickly reminded why I left Twitter early this year. Anyway, I received a lot of positive feedback and some poignant criticism. There were also many well-motivated but ultimately flawed or irrelevant counterarguments and some pretty good arguments that showed a lack of clarity in my explanation.
So, in the spirit of intellectual honesty, I want to write this short follow-up article to address the most relevant counterarguments and reiterate why, despite these, I still stand behind the basic claim that large language models, to this day, cannot truly reason. And so should you.
If you had some doubts about my previous article covering all the grounds, I hope this article clarifies the arguments and rounds up the picture for you. But if you are already convinced of this claim, I still urge you to read this article because it may give you extra ammo to defend against some of the most common—and misplaced—counterarguments.
But first, let me get something straight. Whenever you argue about what LLMs can or cannot do online, you’ll get a flurry of LLM zealots who cannot, for the love of Turing, see past their infatuation with their latest toy. More often than not, the same people that two years ago wanted to fund their own private virtual country running entirely on memecoins.
This article is not for them. It is for you, the sensible, rational reader, who may or may not agree with me, but have, like me, the goal of broadening your understanding and reaching a common ground.
If you’ve read any article on this blog, you probably already know I’m a huge fan of Artificial Intelligence. I’m also a full-time researcher and scholar in this area. For this reason, I approach this topic with an absolute commitment to uncovering the truth about AI’s capabilities and limitations. While LLMs are definitely groundbreaking, they are not perfect. It is only by reasonable criticism that we can make them better.
Sadly, some prominent figures in the AI community also seem infatuated with this technology and are either ignorant of their limitations or simply hypocrites who want to sell you the last coke in the desert. And they steer the online narrative. But blindly believing in angel horns can lead to an implicit acceptance of their intrinsic limitations rather than a critical assessment of where, when, and how LLMs should and, more importantly, should not be deployed.
Like all my previous articles, this one is not an attack on those who champion LLMs—I’m one of them! Instead, it invites open dialogue among readers eager to expand their understanding. Regardless of your feelings about AI, I aim to convince you that this technology has some fundamental limitations we need to address before letting it loose. There is a lot of work in that direction, but it is still insufficient. If this makes at least one of you interested in pursuing a career in making LLMs—and AI systems in general—more robust, trustworthy, and reliable, then we’ve all won!
Phew! That was a rant! With all of that out of my system, let’s get to the important part. I want to address three common misconceptions or fallacies: 1) the fallacy in comparing AI capabilities to humans, 2) a somewhat nuanced misconception about the role of randomness in AI, and 3) a bunch of related misconceptions on how easy would be to make LLMs Turing complete.
But first, let’s formalize what we mean when we say LLMs cannot reason.
What is reasoning (in AI)?
When we AI folks claim LLMs cannot reason, we are not talking about any abstract, philosophical sense of the word “reason”, nor any of the many psychological and sociological nuances it may entail. No, we have a very specific, quantifiable, simplified notion of reasoning that comes straight out of math.
Reasoning is, simply put, the capacity to draw logically sound conclusions from a given premise. In math, there are two main reasoning types or modes: deduction and induction. Induction is somewhat problematic because it involves generalizing claims from specific instances, and thus, it requires some pretty strong assumptions. In contrast, deduction is very straightforward. It is about applying a finite set of logical inference rules to obtain new provably true claims from existing true claims. It is the type of reasoning that mathematicians do all day long when proving new theorems.
Thus, when I say LLMs cannot reason, I’m simply saying there are—sometimes pretty simple—deduction problems they inherently cannot solve. It is not a value judgement, or an opinion based on experience. It is a straightforward claim provable from the definition of reasoning—understood as deductive reasoning—and the inherent limitations of LLMs given their architecture and functionality.
If this is clear, let’s move on to the counterarguments to this claim.
Argument 1: Humans Also Have these Limitations
The most common criticism I received against the assertion that LLMs cannot reason is that, sure, LLMs cannot reason, but neither can humans, right? I mean, humans can be stupendously irrational. But this argument is flawed on many levels, so let’s unpack it.
First, while it is true that humans can make errors in reasoning, the human brain definitely possesses the capacity for open-ended reasoning, as evidenced by the more than 2000 years of solid math we have collectively built. Moreover, all college students—at least in quant fields—at some point have to solve structured problem-solving exercises that require them to apply logical reasoning to arrive at correct conclusions, such as proving theorems. So, while humans can be pretty stupid at times, we are certainly capable of the most rigorous reasoning when trained to do so.
But even more importantly, this assertion is a red herring. Why the fact humans can’t do something immediately makes it ok for a piece of technology to suck at it? Imagine we did this with all our other tech. Sure, that airplane fell down and killed 300 people, but humans can’t fly, so there’s that. Or yes, that submarine imploded, but humans can’t breathe underwater. Or that nuclear power plant melted, but humans can’t stand 3000 degrees of heat, so what’s the big deal?
No, we don’t do that. We compare any new piece of technology with our current best solution, and only if the new thing improves upon the old—at least on some metrics—do we consider it worthwhile.
Granted, we often compare AI capabilities to human capabilities, but this is only because humans are the gold standard for the types of problems we often want AI systems to solve. So we compare LLM’s capacity to generate creative stories with our best writers, and we compare LLMs’ capacity for open-ended dialogue or for emphatic customer assistance with humans because there is nothing out there better than humans at these tasks.
However, there are well-established systems—such as traditional SAT solvers—that excel in structured logical deduction and reasoning tasks. These systems are designed with rigorous validation mechanisms that ensure correctness and reliability in their outputs. They are basically flawless and incredibly fast. So, instead of comparing LLMs to humans in deductive reasoning, let’s compare them with the best solution we currently have for this problem. And there, LLMs definitely suck.
Argument 2: Randomness is Not a Limitation
The second most common criticism I received was regarding the stochastic nature of language models. To recap, I claim that since LLMs generate tokens in a probabilistic fashion—which is a fundamental feature of the paradigm—, their output is inherently unreliable when you require absolute accuracy instead of versatility.
A lot of people correctly argued that, in fact, randomness is essential in problem-solving and a crucial feature of many of the same SAT solvers against I pretend to compare LLMs. How hypocritical of me, they claim, to posit randomness as a limitation when the most effective deductive reasoning algorithms we have are essentially random. And this is true, but only partially, and it makes all the difference. So let me explain.
Randomness plays a vital role in many computational problem-solving techniques, particularly in search algorithms for hard (read NP-complete or NP-hard) problems. Modern SAT solvers, for example, often employ randomized search strategies to efficiently explore vast solution spaces. By introducing randomness into the search process, these solvers can escape local optima and discover satisfactory solutions more quickly than deterministic methods might allow. This ability to leverage randomness is a powerful tool in the arsenal of computational techniques, enabling systems to tackle complex problems that would otherwise be intractable.
However—and here comes the crucial difference—using randomness in the search process does not imply that the entire reasoning process is inherently unreliable. Randomness is confined to the search phase of problem-solving, where it helps identify potential solutions—potential reasoning paths. However, once a candidate solution is found, a deterministic validation phase kicks in that rigorously checks the correctness of the proposed reasoning path.
The distinction between the search and validation phases is paramount in understanding how randomness contributes to effective problem-solving in general. During the search phase, algorithms may employ random sampling or other stochastic methods to explore possibilities and generate potential solutions. This phase allows for flexibility and adaptability, enabling systems to navigate complex landscapes of potential answers.
However, once a potential solution has been identified, it must undergo a validation process that is grounded in deterministic logic. This validation phase involves applying established rules and principles to confirm that the proposed solution meets all necessary criteria for correctness. As a result, any solution that passes this validation step can be confidently accepted as valid, regardless of how it was generated in the first place.
You can have millions of monkeys typing in a typewriter, and at some point, one of them will randomly produce Romeo and Juliet, but only Shakespeare can filter the garbage from the gold and decide which pamphlet to publish.
That silly metaphor means that randomness is good for exploring hypotheses but not for deciding which one to accept. For that, you need a deterministic, provably correct method that doesn’t rely on probabilities—at least if you want to solve the problem exactly.
However, in stark contrast to traditional problem-solving systems like SAT solvers, LLMs lack a robust validation mechanism. While they can generate coherent and contextually relevant responses based on probabilistic reasoning, some of which may be correct reasoning chains, they do not possess a reliable method for verifying the accuracy of those outputs. The verification process is also stochastic and subject to hallucinations, rendering it utterly unreliable.
So, since LLMs evaluate their own outputs using the same probabilistic reasoning they employ for generating them in the first place, there is an unavoidable risk that incorrect conclusions will be propagated as valid responses. The monkeys are the also the editors.
Argument 3: LLMs Can Be Turing-Complete
The final argument I want to address is the notion that LLMs can be made Turing-complete by duct-taping them with some Turing-complete gadget. Here’s a brief recap of what this means.
LLMs have a fixed computational budget—a fixed number of matrix multiplications they perform per input token. This means there are problems that are inherently outside the realm of what they can solve. These problems fall into two categories.
First, NP-Complete problems—such as the very straightforward problem of determining whether a logical formula is valid—are a class of decision problems for which no known polynomial-time solutions exist. Moreover, most experts believe no such algorithm can exist. Thus, these problems probably require an exponential amount of computation for sufficiently large instances. Thus, given the fixed computational budget of LLMs, no matter how big your stochastic parrot, there will always be a logical formula that is simply to large for it to solve.
On the other hand, we have semi-decidable problems, those for which an algorithm can confirm a solution if one exists but may run indefinitely if no solution is found. For these problems, we simply have no option but to keep searching for a potentially unbounded amount of time. And since LLMs are computationally bounded, there are solvable problem instances that simply would require more computing steps than the LLM can produce.
Now, all of the above is clear to anyone who even superficially understands how LLMs work. However, a common argument posited by critics is that LLMs can be rendered Turing complete by integrating them with external tools, such as code generators or general-purpose inference engines, or even easier, let’s wrap it in a recursive procedure that can simply call the LLM as many times as necessary.
And this is true. You can trivially make an LLM Turing-complete, in principle, by duct-taping it with something that is already Turing-complete. You can also build a flame thrower with a bamboo stick, some duct tape, and a fully working flame thrower.
However, simply making LLMs Turing complete in principle does not guarantee that they will produce correct or reliable outputs. The integration of external tools introduces complexity and potential points of failure, particularly if the LLM does not effectively manage interactions with these tools.
The problem is, when you combine stochastic output—prone to hallucinations—with external tools that require precise inputs, you get LLMs that, in principle, have access to all the resources they may need but are incapable of using them reliably.
When relying on external systems for reasoning tasks—for example, having your LLM call a SAT solver when necessary—it is crucial that LLMs can consistently identify the appropriate tool to use and provide it with the correct arguments. However, due to their probabilistic nature and susceptibility to hallucinations, LLMs struggle to do so reliably. And even if they successfully invoke an external tool, there is no guarantee that they will interpret or apply the tool’s output correctly in their reasoning process.
So, Turing-incompleteness or bounded computation may not be a knockout argument on its own, but when combined with the other inherent limitations of LLMs—crucially, their unreliability—it is clear there are no guarantees even the most advanced models won’t fail to solve some reasoning task.
And here is the final kicker: approximate reasoning is not good enough. If the LLM fails one out of every million times to produce the right deduction, that still means the LLM cannot reason. For all practical purposes, you may be happy with a model that gets it right 9 out of 10 or 99 out of 100, but in mission-critical tasks, nothing short of perfect reasoning is good enough.
And that’s the claim: LLMs are incapable, by design, of perfect reasoning.
Conclusion
The purpose of this and the previous article is to convince you of two claims:
Large Language Models currently lack the capability to perform a well-defined form of reasoning that is essential for many decision-making processes.
We currently have absolutely no idea how to solve this in the near future.
This matters because there is a growing trend to promote LLMs as general-purpose reasoning engines. As more users begin to rely on LLMs for important decisions, the implications of their limitations become increasingly significant. At some point, someone will trust an LLM with a life-and-death decision, with catastrophic consequences.
More importantly, the primary challenges in making LLMs trustworthy for reasoning are immense. Despite ongoing research and experimentation, we have yet to discover solutions that effectively bridge the gap between LLM capabilities and the rigorous standards required for reliable reasoning. Currently, our best efforts in this area are nothing but duct tape—temporary fixes that do not address the underlying limitations of the stochastic language modeling paradigm.
Now, I want to stress that these limitations do not diminish the many other applications where LLMs excel as stochastic language generators. In creative writing, question answering, user assistance, translation, summarization, automatic documentation, and even coding, many of the limitations we have discussed here are actually features.
The thing is, this is what language models were designed for—to generate plausible, human-like, varied, not-necessarily-super-accurate language. The whole paradigm of stochastic language modeling is optimized for this task, and it excels at it. It is much better than anything else we’ve ever designed. But when we ask LLMs to step outside that range of tasks, they become brittle, unreliable, and, worse, opaquely so.
If LLMs are to fulfill even some of our highly unrealistic expectations for them, we must prioritize solving the challenge of provably correct reasoning. Until then, all we have is a stochastic parrot—a fun toy with some interesting use cases but not a truly transformative technology.
Thanks for reading! Remember you can get my book on LLMs at half the publication price while its on early access.