zlacker

[parent] [thread] 14 comments
1. JamesS+(OP)[view] [source] 2025-12-05 22:29:10
I just oneshot it with claude code (opus 4.5) using this prompt. It took about 5 mins and included detecting that it was cheating at first (drew a line around the boundary of the maze instead), so it added guardrails for that:

```

Create a devenv project that does the following:

  - Read the image at maze.jpg
  - Write a script that solves the maze  in the most optimal way between the mouse and the cheese
  - Generate a new image which is of the original maze, but with a red line that represents the calculated path
Use whatever lib/framework is most appropriate

```

  Output: https://gist.github.com/J-Swift/ceb1db348f46ba167948f734ff0fc604  
  Solution: https://imgur.com/a/bkJloPT
replies(3): >>esafak+t2 >>nl+Zw >>sebast+MR
2. esafak+t2[view] [source] 2025-12-05 22:42:50
>>JamesS+(OP)
If you allow tool use much simpler models can solve it.
3. nl+Zw[view] [source] 2025-12-06 03:12:28
>>JamesS+(OP)
Programs can solve mazes and LLMs can program. That's a different thing completely.
replies(1): >>JamesS+Cz
◧◩
4. JamesS+Cz[view] [source] [discussion] 2025-12-06 03:37:58
>>nl+Zw
That just seems like an arbitrary limitation. Its like asking someone to do answer a math calculation but "no thinking allowed". Like, I guess we can gauge if a model just _knows all knowable things in the universe_ using that method... but anything of any value that you are gauging in terms of 'intelligence', is going to actually be validating their ability to go "outside the scope" of what they actually are (an autocomplete on steroids).
replies(3): >>flying+GC >>nearbu+XL >>rglull+LQ
◧◩◪
5. flying+GC[view] [source] [discussion] 2025-12-06 04:12:54
>>JamesS+Cz
We know there are very simple maze solving algorithms you could code in few lines of Python but no one could claim that constitutes intelligence. The difference is between applying intuitive logic and using a predetermined tool.
◧◩◪
6. nearbu+XL[view] [source] [discussion] 2025-12-06 06:30:44
>>JamesS+Cz
It depends whether you're asking it to solve a maze because you just need something that can solve mazes, or if you're trying to learn something about the model's abilities in different domains. If it can't solve a maze by inspection instead of writing a program to solve it, that tells you something about its visual reasoning abilities, and that can help you predict how they'll perform on other visual reasoning tasks that aren't easy to solve with code.
replies(2): >>seanmc+DM >>JamesS+4N
◧◩◪◨
7. seanmc+DM[view] [source] [discussion] 2025-12-06 06:39:36
>>nearbu+XL
You could actually add mazes and paths through them to the training corpus, or make a model for just solving mazes. I wonder how effective it would be, I’m sure someone has tried it. I doubt it would generalize enough to give the AI new visual reasoning capabilities beyond just solving mazes.
◧◩◪◨
8. JamesS+4N[view] [source] [discussion] 2025-12-06 06:46:24
>>nearbu+XL
Again, think about how the models work. They generate text sequentially. Think about how you solve the maze in your mind. Do you draw a line direct to the finish? No, it would be impossible to know what the path was until you had done it. But at that point you have now backtracked several times. So, what could a model _possibly_ be able to do for this puzzle which is "fair game" as a valid solution, other than magically know an answer by pulling it out of thin air?
replies(2): >>nl+mh1 >>nearbu+gs2
◧◩◪
9. rglull+LQ[view] [source] [discussion] 2025-12-06 07:56:20
>>JamesS+Cz
By your analogy, the developers of stockfish are better chess players than any grandmaster.

Tool use can be a sign of intelligence, but "being able to use a tool to solve a problem" is not the same as "being intelligent enough to solve a specific class of problems".

replies(1): >>JamesS+Ur1
10. sebast+MR[view] [source] 2025-12-06 08:11:52
>>JamesS+(OP)
This (writing a program to solve the problem) would be a perfectly valid solution if the model had come up with it.

I participated in a "math" competition in high school which mostly tested logic and reasoning. The reason my team won by a landslide is because I showed up with a programmable calculator and knew how to turn the problems into a program that could solve them.

By prompting the model to create the program, you're taking away one of the critical reasoning steps needed to solve the problem.

◧◩◪◨⬒
11. nl+mh1[view] [source] [discussion] 2025-12-06 13:20:00
>>JamesS+4N
> So, what could a model _possibly_ be able to do for this puzzle which is "fair game" as a valid solution, other than magically know an answer by pulling it out of thin air?

Represent the maze as a sequence of movements which either continue or end up being forced to backtrack.

Basically it would represent the maze as a graph and do a depth-first search, keeping track of what nodes it as visited in its reasoning tokens.

See for example https://stackoverflow.com/questions/3097556/programming-theo... where the solution is represented as:

A B D (backtrack) E H L (backtrack) M * (backtrack) O (backtrack thrice) I (backtrack thrice) C F (backtrack) G J

replies(1): >>JamesS+yn1
◧◩◪◨⬒⬓
12. JamesS+yn1[view] [source] [discussion] 2025-12-06 14:16:12
>>nl+mh1
And my question to you is “why is that substantially different than writing the correct algorithm to do it”? Im arguing its a myopic view of what we are going to call “intelligence”. And it ignores how human thought works in the same way by using abstractions to move to the next level of reasoning.

In my opinion, being able to write the code to do the thing is effectively the same exact thing as doing the thing in terms of judging if its “able to do” that thing. Its functionality equivalent for evaluating what the “state of the art” is, and honestly is naive to what these models even are. If the model hid the tool calling in the background instead, and only showed you its answer would we say its more intelligent? Because that’s essentially how a lot of these things work already. Because again, the actual “model” is just a text autocomplete engine and it generates from left to right.

replies(1): >>nl+uQ2
◧◩◪◨
13. JamesS+Ur1[view] [source] [discussion] 2025-12-06 14:56:24
>>rglull+LQ
Im not talking about this being the "best maze solver" and "better at solving mazes than humans". Im saying the model is "intelligent enough" to solve a maze.

And what Im really saying is that we need to stop moving the goal post on what "intelligence" is for these models, and start moving the goal post on what "intelligence" actually _is_. The models are giving us an existential crisis on not only what it might mean to _be_ intelligent, but also how it might actually work in our own brains. Im not saying the current models are skynet, but Im saying I think theres going to be a lot learned by reverse engineering the current generation of models to really dig into how they are encoding things internally.

◧◩◪◨⬒
14. nearbu+gs2[view] [source] [discussion] 2025-12-06 23:41:03
>>JamesS+4N
First, the thrust of your argument is that you already knew that it would be impossible for a model like Gemini 3 Pro to solve a maze without code, so there's nothing interesting to learn from trying it. But the rest of us did not know this.

> Again, think about how the models work. They generate text sequentially.

You have some misconception on how these models work. Yes, the transformer LLMs generate output tokens sequentially, but it's weird you mention this because it has no relevance to anything. They see and process tokens in parallel, and then process across layers. You can prove, mathematically, that it is possible for a transformer-based LLM to perform any maze-solving algorithm natively (given sufficient model size and the right weights). It's absolutely possible for a transformer model to solve mazes without writing code. It could have a solution before it even outputs a single token.

Beyond that, Gemini 3 Pro is a reasoning model. It writes out pages of hidden tokens before outputting any text that you see. The response you actually see could have been the final results after it backtracked 17 times in its reasoning scratchpad.

◧◩◪◨⬒⬓⬔
15. nl+uQ2[view] [source] [discussion] 2025-12-07 04:34:47
>>JamesS+yn1
> In my opinion, being able to write the code to do the thing is effectively the same exact thing as doing the thing

That's great, but it's demonstrably false.

I can write code that calculates the average letter frequency across any Wikipedia article. I can't do that in my head without tools because of the rule of seven[1].

Tool use is absolutely an intelligence amplifier but it isn't the same thing.

> Because again, the actual “model” is just a text autocomplete engine and it generates from left to right.

This is technically true, but somewhat misleading. Humans speak "left to right" too. Specifically, LLMs do have some spatial reasoning ability (which is what you'd expect with RL training: otherwise they'd just predict the most popular token): https://snorkel.ai/blog/introducing-snorkelspatial/

[1] https://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus...

[go to top]