Gemini 3 Pro: the frontier of vision AI

>>xnx+(OP)
Well

It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.

In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.

Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".

That aside though, I still wouldn't call it particularly impressive.

As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.

>>Workac+cU
This is exactly why I believe LLMs are a technological dead end. Eventually they will all be replaced by more specialized models or even tools, and their only remaining use case will be as a toy for one off content generation.

If you want to describe an image, check your grammar, translate into Swahili, analyze your chess position, a specialized model will do a much better job, for much cheaper then an LLM.

>>runarb+ec1
I think we are too quick to discount the possibility that this flaw is slightly intentional, in the sense that the optimization has a tight budget to work with (equivalent of ~3000 tokens) so why would it waste capacity on this when it could improve capabilities around reading small text in obscured images? Sort of like humans have all these rules of thumbs that backfire in all these ways but that's the energy efficient way to do things.

>>energy+7C1
Even so, that doesn’t take away from my point. Traditional specialized models can do these things already, for much cheaper and without expensive optimization. What traditional models cannot do is the toy aspect of LLM, and that is the only usecase I see for this technology going forward.

Lets say you are right and these things will be optimized, and in, say, 5 years, most models from the big players will be able do things like reading small text in an obscure image, draw a picture of a glass of wine filled to the brim, draw a path through a maze, count the legs of a 5 footed dog, etc. And in doing so finished their last venture capital subsidies (bringing the actual cost of these to their customers). Why would people use LLMs for these when a traditional specialized model can do it for much cheaper?

>>runarb+EE1
Having one tool that you can use to do all of these things makes a big difference. If I'm a financial analyst at a company I don't need to know how to implement and use 5 different specialized ML models, I can just ask one tool (that can still use tools on the backend to complete the task efficiently)

zlacker