Represent the maze as a sequence of movements which either continue or end up being forced to backtrack.
Basically it would represent the maze as a graph and do a depth-first search, keeping track of what nodes it as visited in its reasoning tokens.
See for example https://stackoverflow.com/questions/3097556/programming-theo... where the solution is represented as:
A B D (backtrack) E H L (backtrack) M * (backtrack) O (backtrack thrice) I (backtrack thrice) C F (backtrack) G J
In my opinion, being able to write the code to do the thing is effectively the same exact thing as doing the thing in terms of judging if its “able to do” that thing. Its functionality equivalent for evaluating what the “state of the art” is, and honestly is naive to what these models even are. If the model hid the tool calling in the background instead, and only showed you its answer would we say its more intelligent? Because that’s essentially how a lot of these things work already. Because again, the actual “model” is just a text autocomplete engine and it generates from left to right.
> Again, think about how the models work. They generate text sequentially.
You have some misconception on how these models work. Yes, the transformer LLMs generate output tokens sequentially, but it's weird you mention this because it has no relevance to anything. They see and process tokens in parallel, and then process across layers. You can prove, mathematically, that it is possible for a transformer-based LLM to perform any maze-solving algorithm natively (given sufficient model size and the right weights). It's absolutely possible for a transformer model to solve mazes without writing code. It could have a solution before it even outputs a single token.
Beyond that, Gemini 3 Pro is a reasoning model. It writes out pages of hidden tokens before outputting any text that you see. The response you actually see could have been the final results after it backtracked 17 times in its reasoning scratchpad.
That's great, but it's demonstrably false.
I can write code that calculates the average letter frequency across any Wikipedia article. I can't do that in my head without tools because of the rule of seven[1].
Tool use is absolutely an intelligence amplifier but it isn't the same thing.
> Because again, the actual “model” is just a text autocomplete engine and it generates from left to right.
This is technically true, but somewhat misleading. Humans speak "left to right" too. Specifically, LLMs do have some spatial reasoning ability (which is what you'd expect with RL training: otherwise they'd just predict the most popular token): https://snorkel.ai/blog/introducing-snorkelspatial/
[1] https://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus...