The Illusion of Thinking: Strengths and limitations of reasoning models [pdf]

>>sneak+3P
Negative Negs spit out low effort snark, they said the same thing about solar, electric cars, even multicore, jit, open source. Thanks for refuting them, the forum software itself should either quarantine the response or auto respond before the comment is submitted. These people don't build the future.

Fusion News, May 28th, 2025 https://www.youtube.com/watch?v=1YHcI-SfKx8

>>sneak+3P
It isnt when you look at Q total. Total energy input for all needed support systems versus energy produced. See https://en.wikipedia.org/wiki/Fusion_energy_gain_factor for more details

>>amrrs+(OP)
> We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles.

It seems that AI LLMs/LRMs need helps from their distant cousins namely logic, optimization and constraint programming that can be attributed as intelligent automation or IA [1],[2],[3],[4].

[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:

https://www.youtube.com/live/TknN8fCQvRk

[2] "We Really Don't Know How to Compute!" - Gerald Sussman - MIT (2011) [video]:

https://youtube.com/watch?v=HB5TrK7A4pI

[3] Google OR-Tools:

https://developers.google.com/optimization

[4] MiniZinc:

https://www.minizinc.org/

>>imiric+N41
>There's nothing "omniscient" or "dim-witted" about these tools

I disagree in that that seems quite a good way of describing them. All language is a bit inexact.

Also I don't buy we are no closer to AI than ten years ago - there seem lots going on. Just because LLMs are limited doesn't mean we can't find or add other algorithms - I mean look at alphaevolve for example https://www.technologyreview.com/2025/05/14/1116438/google-d...

>found a faster way to solve matrix multiplications—a fundamental problem in computer science—beating a record that had stood for more than 50 years

I figure it's hard to argue that that is not at least somewhat intelligent?

>>ngneer+ZW
When people say LLMs can't do X, I like to try it.

    Q: Complete 3 by generating new knowledge:
    1. today is warm
    2. cats likes warm temperatures
    3.

A: Therefore, a cat is likely to be enjoying the weather today.

Q: does the operation to create new knowledge you did have a specific name?

A: ... Deductive Reasoning

Q: does the operation also have a Latin name?

A: ... So, to be precise, you used a syllogismus (syllogism) that takes the form of Modus Ponens to make a deductio (deduction).

https://aistudio.google.com/app/prompts/1LbEGRnzTyk-2IDdn53t...

People then say "of course it could do that, it just pattern matched a Logic text book. I meant in a real example, not an artificially constructed one like this one. In a complex scenario LLMs obviously can't do Modus Ponens.

>>thomas+kb1
>Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

>In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task.

https://arxiv.org/abs/2311.13373

>>amrrs+(OP)
I think one of the reason we are confused about what LLMs can do is because they use language. And we look at the "reasoning traces" and the tokens there look human, but what is actually happening is very alien to us, as shown by "Biology of Large Language Models"[1] and "Safety Alignment Should Be Made More Than Just a Few Tokens Deep"[2]

I am struggling a lot to see what the tech can and can not do, particularly designing systems with them, and how to build systems where the whole is bigger than the sum of its parts. And I think this is because I am constantly confused by their capabilities, despite understanding their machinery and how they work, their use of language just seems like magic. I even wrote https://punkx.org/jackdoe/language.html just to remind myself how to think about it.

I think this kind of research is amazing and we have to spend tremendous more effort into understanding how to use the tokens and how to build with them.

[1]: https://transformer-circuits.pub/2025/attribution-graphs/bio...

[2]: https://arxiv.org/pdf/2406.05946

>>dleeft+po1
You know the meme "seems like will have AGI before we can reliably parse PDFs" :)

So if you are building a system, lets say you ask it to parse a pdf, and you put a judge to evaluate the quality of the output, and then you create a meta judge to improve the prompts of the parser and the pdf judge. The question is, is this going to get better as it is running, and even more, is it going to get better as the models are getting better?

You can build the same system in completely different way, more like 'program synthesis' imagine you dont use llms to parse, but you use them to write parser code, and tests, and then judge to judge the tests, or even escalate to human to verify, then you train your classifier that picks the parser. Now this system is much more likely to improve itself as it is running, and as the models are getting better.

Few months ago Yannic Kilcher gave this example as that it seems that current language models are very constrained mid-sentence, because they most importantly want produce semantically consistent and grammatically correct text, so the entropy mid sentence is very different than the entropy after punctuation. The . dot "frees" the distribution. What does that mean for "generalists" or "specialists" approach when sampling the wrong token can completely derail everything?

If you believe that the models will "think" then you should bet on the prompt and meta prompt approach, if you believe they will always be limited then you should build with program synthesis.

And, honestly, I am totally confused :) So this kind of research is incredibly useful to clear the mist. Also things like https://www.neuronpedia.org/

E.G. Why compliment (you can do this task), guilt (i will be fired if you don't do this task), and threatening (i will harm you if you don't do this task) work with different success rate? Sergey Brin said recently that threatening works best, I cant get my self to do it, so I take his word for it.

>>jbentl+7q1
A slightly less cynical take is that they want to temper expectations for the capabilities of LLMs in people’s day-to-day lives, specifically in the context of Apple products. A “smarter Siri” is never going to be an autonomous personal assistant à la Jarvis from Iron Man, which seems to be where a lot of investors think things are going. That tracks with this [0] preprint also released by Apple a few months ago.

A slightly more cynical take is that you’re absolutely correct, and making excuses for weak machine learning prowess has long been an Apple tenet. Recall that Apple never made privacy a core selling point until it was clear that Siri was years behind Google’s equivalent, which Apple then retroactively tried to justify by claiming “we keep your data private so we can’t train on it the way Google can.”

[0] https://arxiv.org/pdf/2410.05229

>>dmos62+3G1
For studying? Mainly watching and re-watching Karpathy's 'Zero To Hero'[1] and Stanford's 'Introduction to Convolutional Neural Networks for Visual Recognition'[2], also a lot of transformers from scratch videos like Umar Jamali's videos[3], and I also study backwards to McCulloch and Pitts. Reading the 30 papers https://punkx.org/jackdoe/30.html and so on.

And of course Yannic Kilcher[4], and also listening in on the paper discussions they do on discord.

Practicing a lot with just doing backpropagation by hand and making toy models by hand to get intuition for the signal flow, and building all kinds of smallish systems, e.g. how far can you push whisper, small qwen3, and kokoro to control your computer with voice?

People think that deepseek/mistral/meta etc are democratizing AI, but its actually Karpathy who teaches us :) so we can understand them and make our own.

[1] https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...

[2] https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3FW7Lu3i5...

[3] https://www.youtube.com/@umarjamilai

[4] https://www.youtube.com/@YannicKilcher

>>amrrs+(OP)
Not ashamed to admit I found the original paper daunting, so I made a top-down, Q&A-based mind map to help me understand it: https://app.gwriter.io/#/mindmap/view/2d128d6e-c3e8-4b99-8f4...

>>Too+Z51
> If not, what fundamental building blocks are missing to get there

If I were to guess, the missing building block is the ability to abstract - which is the ability to create a symbol to represent something. Concrete example of abstraction is seen in the axioms of lambda calculus. 1) ability to posit a variable, 2) ability to define a function using said variable, and 3) the ability to apply functions to things. Abstraction arises from a process in the brain which we have not understood yet and could be outside of computation as we know it per [1]

[1] https://www.amazon.com/Emperors-New-Mind-Concerning-Computer...

>>tim333+Sk1
> you have to judge intelligence by the results rather than the mechanism

This would be the exact opposite conclusion of the Chinese room: https://en.wikipedia.org/wiki/Chinese_room

I think you'd need to offer a stronger counter argument than the one you presented here.

>>ngneer+3Y1
> "Adam can lift 1000 pounds of steel. Can Adam lift 1000 pounds of feathers?"

Worked for me:

https://chatgpt.com/share/6844813a-6e4c-8006-b560-c0be223eeb...

gemma3-27b, a small model, had an interesting take:

> This is a classic trick question!

> While Adam can lift 1000 pounds, no, he likely cannot lift 1000 pounds of feathers.

> Volume: Feathers take up a huge amount of space for their weight. 1000 pounds of feathers would be an enormous volume – likely far too large for Adam to even get under, let alone lift. He'd be trying to lift a massive, bulky cloud.

> Practicality: Even if he could somehow get it under a barbell, the feathers would shift and compress, making a secure grip impossible.

> The question plays on our understanding of weight versus volume. It's designed to make you focus on the "1000 pounds" and forget about the practicalities of lifting something so voluminous.

Tried the counting question on the smallest model, gemma-3n-34b, it can run on a smartphone:

> Yes, if Adam can count to 14000, he can definitely count to 13500. Counting to a smaller number is a basic arithmetic operation. 13500 is less than 14000.

>>thomas+iz2
Towers of Hanoi IS an algorithmic problem. It is a high-school/college level problem when designing algorithms, probably kid level when trying to solve intuitively, heuristically or via brute force for few disks (i.e. like when playing Mass Effect 1 or similar games that embed it as a minigame*).

* https://www.youtube.com/watch?v=1vTBVyhX7n4

>>amrrs+(OP)
All "reasoning" models hit a complexity wall where they completely collapse to 0% accuracy.

No matter how much computing power you give them, they can't solve harder problems.

This research suggests we're not as close to AGI as the hype suggests.

Current "reasoning" breakthroughs may be hitting fundamental walls that can't be solved by just adding more data or compute.

Apple's researchers used controllable puzzle environments specifically because:

• They avoid data contamination • They require pure logical reasoning • They can scale complexity precisely • They reveal where models actually break

Models could handle 100+ moves in Tower of Hanoi puzzles but failed after just 4 moves in River Crossing puzzles.

This suggests they memorized Tower of Hanoi solutions during training but can't actually reason.

https://x.com/RubenHssd/status/1931389580105925115

>>amrrs+(OP)
Discussion of this article by Gary Marcus: https://garymarcus.substack.com/p/a-knockout-blow-for-llms

>>pegasu+jV1
It is hard to compare models with humans so not sure how to answer it for both. :)

But, for models, this is an interesting finding because a lot of LRMs are LLMs with a _bunch_ of post-training done on top. We know this about DeepSeek R1 (one of the models evaluated in the Apple paper) for sure. They write extensively about how they took DeepSeek-V3-Base and made R1 with it. [1]

If the post-training is resulting in lower performance on simpler tasks then it ought to inspire more research on how to make it so that it doesn't -- i.e., with more training (of any kind), we should be gaining more capabilities. This has been a problem with DNNs historically, btw. We had these issues when fine-tuning text/image classifiers as well. Some weight changes can be destructive. So, it has to be done with a _lot_ of care. And, I am sure folks are working on it, to be honest. Maybe some of them will say something here. :-)

[1] https://github.com/deepseek-ai/DeepSeek-R1

>>amrrs+(OP)
The system prompt in this experiment limits the solution to always spell out the concrete moves verbally. A human solving the Tower of Hanoi gives up around N=4 and goes off to invent a recursive solution instead. Prompted differently, the LLM would solve these puzzles just fine.

Here is my complete review/analysis of the paper: https://www.linkedin.com/pulse/art-abstraction-human-advanta...

edit: fixed typo

>>throwa+R92
We have to ditch language processing. And we will with online energy based models that machines boot from.

Maxwell could not get the theory of electromagnetism to work until he ditched pulleys and levers he’d included to describe the mechanics.

We won’t get AGI until we realize “there is no spoon” and language has nothing to do with our intelligence, just with out social tribalism: https://www.scientificamerican.com/article/you-dont-need-wor...

Take language out of the equation and drawing a circle, triangles, letters is just statistical physics. We can capture in energy models stored in an online state, statistical physics relative to the machine; its electromagnetic geometry: https://iopscience.iop.org/article/10.1088/1742-6596/2987/1/...

Our language doesn’t exist without humans. It’s not an immutable property of physics. It’s obfuscation and mind viruses. It’s story mode.

The computer acting as a web server or an LLM has an inherent energy model to it. New models of those patterns will be refined to a statefulness that strips away unnecessary language constructs in the system; like a lot of software most don’t use just developers.

I look forward to continuing my work in the hardware world to further compress and reduce the useless state of past systems of though we copy paste around to serve developers, to reduce context to sort through, and improve model quality: https://arxiv.org/abs/2309.10668

Single function factory hardware with embedded “prompt” that will boot from a model and the machines state will scaffold itself from there are coming: https://creativestrategies.com/jensen-were-with-you-but-were...

>>fc417f+qH2
That’s the point: AI is a marketing term and always has been. The underlying tech changes with every hype wave.

One of the first humanoid robots was an 18th century clockwork mechanism inside a porcelain doll that autonomously wrote out “Cogito Ergo Sum” in cursive with a pen. It was considered thought provoking at the time because it implied that some day machines could think.

BBC video posted to reddit 10 years ago: https://www.reddit.com/r/history/s/d6xTeqfKCv

>>curiou+eK
Using puzzles is not special or anything, it has been done a million times since before (and including) the LSTM paper (1997) https://www.bioinf.jku.at/publications/older/2604.pdf

>>amrrs+(OP)
One of the best analysis I found is this blog by Vishal Misra https://medium.com/@vishalmisra/the-illusion-of-thinking-why...

>>bwfan1+6M1
No. It's not microtubules. Enough with the g-darn microtubules already. https://www.biorxiv.org/content/10.1101/712794v1

"We used an antimicrotubular agent (parbendazole) and disrupted microtubular dynamics in paramecium to see if microtubules are an integral part of information storage and processing in paramecium’s learning process. We observed that a partial allosteric modulator of GABA (midazolam) could disrupt the learning process in paramecium, but the antimicrotubular agent could not. Therefore, our results suggest that microtubules are probably not vital for the learning behavior in P. caudatum. Consequently, our results call for a further revisitation of the microtubular information processing hypothesis."

>>danari+y85
>But LLMs are fundamentally, completely, incapable of X. It is not something that can be a result of their processes.

This is the very point of contention. You don't get to just assume it.

> it is because it is, at its root, a statistical engine generating plausible next tokens, with no semantic understanding of the underlying data.

Another highly contentious point you are just outright assuming. LLMs are modelling the world, not just "predicting the next token". Some examples here[1][2][3]. Anyone claiming otherwise at this point is not arguing in good faith. It's interesting how the people with the strongest opinions about LLMs don't seem to understand them.

[1] https://arxiv.org/abs/2405.15943

[2] https://x.com/OwainEvans_UK/status/1894436637054214509

[3] https://www.anthropic.com/research/tracing-thoughts-language...

>>amrrs+(OP)
This feels a bit like a weird way to test 'thinking' in models, and reminds me of the old story of Gauss[1] and his classmates being assigned the task of adding up the numbers from 1-100.

I think the way the paper lays out the performance regimes is pretty interesting, but I don't think they achieved their goal of demonstrating that LRMs can't use reasoning to solve complex puzzles organically (without contamination/memorization): IMO testing the model's ability to define an algorithm to solve the puzzle would have been a better evaluation of that (rather than having the model walk through all of the steps manually). I don't know that I'd use an LRM for this sort of long-tail reasoning where it has to follow one single process for a long time over just one prompt; if I needed a really long chain of reasoning I'd use an agent or workflow.

It sounds more like the tests measure a model's ability to reason coherently and consistently over many steps rather than a model's ability to understand and solve a complex puzzle. For example, for the Tower of Hanoi, a prompt like "Define an algorithm that will find the sequence of moves to transform the initial configuration into the goal configuration" (e.g. "find an arithmetic series formula, young Gauss") seems like it would have been a better approach than "Find the sequence of moves to transform the initial configuration into the goal configuration" (e.g. "add up all these numbers"). This is kind of seen in how the study included a step where the LRMs were given the algorithm and then asked to solve the problem, the focus was on an LRM's ability to follow the steps, not their ability to come up with an algorithm/solution on their own.

In a job interview, for example, who among us would accept inability to hold all of the `(2^n) - 1` steps of the Tower of Hanoi in our brain as evidence of poor reasoning ability?

Again, I think it's a really interesting study covering a model's ability to consistently follow a simple process over time in pursuit of a static objective (and perhaps a useful benchmark moving forward), but I'm not confident that it successfully demonstrates a meaninful deficiency in overall reasoning capability.

[1]: https://www.americanscientist.org/article/gausss-day-of-reck...

>>make3+Uy4
The Arc Prize just released a new update and it's all minigame puzzles

https://arcprize.org/

>>pothol+2V6
"bated breath", dammit!

- an old fisherman and aficionado of William Shakespeare.

https://www.vocabulary.com/articles/pardon-the-expression/ba...

FTFA: "Unless you've devoured several cans of sardines in the hopes that your fishy breath will lure a nice big trout out of the river, baited breath is incorrect."*

zlacker

The Illusion of Thinking: Strengths and limitations of reasoning models [pdf]