Gemini 3 Pro: the frontier of vision AI

>>xnx+(OP)
Well

It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.

In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.

Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".

That aside though, I still wouldn't call it particularly impressive.

As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.

>>Workac+cU
I don’t know much about AI, but I have this image test that everything has failed at. You basically just present an image of a maze and ask the LLM to draw a line through the most optimal path.

Here’s how Nano Banana fared: https://x.com/danielvaughn/status/1971640520176029704?s=46

>>daniel+dY
I just oneshot it with claude code (opus 4.5) using this prompt. It took about 5 mins and included detecting that it was cheating at first (drew a line around the boundary of the maze instead), so it added guardrails for that:

```

Create a devenv project that does the following:

  - Read the image at maze.jpg
  - Write a script that solves the maze  in the most optimal way between the mouse and the cheese
  - Generate a new image which is of the original maze, but with a red line that represents the calculated path

Use whatever lib/framework is most appropriate

```

  Output: https://gist.github.com/J-Swift/ceb1db348f46ba167948f734ff0fc604  
  Solution: https://imgur.com/a/bkJloPT

>>JamesS+hh1
Programs can solve mazes and LLMs can program. That's a different thing completely.

>>nl+gO1
That just seems like an arbitrary limitation. Its like asking someone to do answer a math calculation but "no thinking allowed". Like, I guess we can gauge if a model just _knows all knowable things in the universe_ using that method... but anything of any value that you are gauging in terms of 'intelligence', is going to actually be validating their ability to go "outside the scope" of what they actually are (an autocomplete on steroids).

>>JamesS+TQ1
It depends whether you're asking it to solve a maze because you just need something that can solve mazes, or if you're trying to learn something about the model's abilities in different domains. If it can't solve a maze by inspection instead of writing a program to solve it, that tells you something about its visual reasoning abilities, and that can help you predict how they'll perform on other visual reasoning tasks that aren't easy to solve with code.

>>nearbu+e32
Again, think about how the models work. They generate text sequentially. Think about how you solve the maze in your mind. Do you draw a line direct to the finish? No, it would be impossible to know what the path was until you had done it. But at that point you have now backtracked several times. So, what could a model _possibly_ be able to do for this puzzle which is "fair game" as a valid solution, other than magically know an answer by pulling it out of thin air?

>>JamesS+l42
First, the thrust of your argument is that you already knew that it would be impossible for a model like Gemini 3 Pro to solve a maze without code, so there's nothing interesting to learn from trying it. But the rest of us did not know this.

> Again, think about how the models work. They generate text sequentially.

You have some misconception on how these models work. Yes, the transformer LLMs generate output tokens sequentially, but it's weird you mention this because it has no relevance to anything. They see and process tokens in parallel, and then process across layers. You can prove, mathematically, that it is possible for a transformer-based LLM to perform any maze-solving algorithm natively (given sufficient model size and the right weights). It's absolutely possible for a transformer model to solve mazes without writing code. It could have a solution before it even outputs a single token.

Beyond that, Gemini 3 Pro is a reasoning model. It writes out pages of hidden tokens before outputting any text that you see. The response you actually see could have been the final results after it backtracked 17 times in its reasoning scratchpad.

zlacker