It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.
In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.
Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".
That aside though, I still wouldn't call it particularly impressive.
As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.
Here’s how Nano Banana fared: https://x.com/danielvaughn/status/1971640520176029704?s=46
```
Create a devenv project that does the following:
- Read the image at maze.jpg
- Write a script that solves the maze in the most optimal way between the mouse and the cheese
- Generate a new image which is of the original maze, but with a red line that represents the calculated path
Use whatever lib/framework is most appropriate```
Output: https://gist.github.com/J-Swift/ceb1db348f46ba167948f734ff0fc604
Solution: https://imgur.com/a/bkJloPTRepresent the maze as a sequence of movements which either continue or end up being forced to backtrack.
Basically it would represent the maze as a graph and do a depth-first search, keeping track of what nodes it as visited in its reasoning tokens.
See for example https://stackoverflow.com/questions/3097556/programming-theo... where the solution is represented as:
A B D (backtrack) E H L (backtrack) M * (backtrack) O (backtrack thrice) I (backtrack thrice) C F (backtrack) G J
In my opinion, being able to write the code to do the thing is effectively the same exact thing as doing the thing in terms of judging if its “able to do” that thing. Its functionality equivalent for evaluating what the “state of the art” is, and honestly is naive to what these models even are. If the model hid the tool calling in the background instead, and only showed you its answer would we say its more intelligent? Because that’s essentially how a lot of these things work already. Because again, the actual “model” is just a text autocomplete engine and it generates from left to right.
That's great, but it's demonstrably false.
I can write code that calculates the average letter frequency across any Wikipedia article. I can't do that in my head without tools because of the rule of seven[1].
Tool use is absolutely an intelligence amplifier but it isn't the same thing.
> Because again, the actual “model” is just a text autocomplete engine and it generates from left to right.
This is technically true, but somewhat misleading. Humans speak "left to right" too. Specifically, LLMs do have some spatial reasoning ability (which is what you'd expect with RL training: otherwise they'd just predict the most popular token): https://snorkel.ai/blog/introducing-snorkelspatial/
[1] https://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus...