zlacker

[return to "Gemini 3 Pro: the frontier of vision AI"]
1. Workac+cU[view] [source] 2025-12-05 20:26:05
>>xnx+(OP)
Well

It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.

In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.

Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".

That aside though, I still wouldn't call it particularly impressive.

As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.

◧◩
2. vunder+E41[view] [source] 2025-12-05 21:17:43
>>Workac+cU
Anything that needs to overcome concepts which are disproportionately represented in the training data is going to give these models a hard time.

Try generating:

- A spider missing one leg

- A 9-pointed star

- A 5-leaf clover

- A man with six fingers on his left hand and four fingers on his right

You'll be lucky to get a 25% success rate.

The last one is particularly ironic given how much work went into FIXING the old SD 1.5 issues with hand anatomy... to the point where I'm seriously considering incorporating it as a new test scenario on GenAI Showdown.

◧◩◪
3. Xenoph+8o1[view] [source] 2025-12-05 23:12:51
>>vunder+E41
It mostly depends on "how" the models work. Multi-modal unified text/image sequence to sequence models can do this pretty well, diffusion doesn't.
◧◩◪◨
4. vunder+nL1[view] [source] 2025-12-06 02:42:33
>>Xenoph+8o1
Multimodal certainly helps but "pretty well" is a stretch. I'd be curious to know what multimodal model in particular you've tried that could consistently handle generative prompts of the above nature (without human-in-the-loop corrections).

For example, to my knowledge ChatGPT is unified and I can guarantee it can't handle something like a 7-legged spider.

◧◩◪◨⬒
5. Xenoph+2u3[view] [source] 2025-12-06 21:22:23
>>vunder+nL1
I just got the model to generate a spider without a leg by saying "Spider missing one leg" and it did it fine. It won't do it "every time", (in my case 1 out of 2), but it will do it. I used the GPT-image-1 model in the api. I don't think they are actually running a full end to end text/image model sequence model. I don't think anyone really is commercially, they are hybrids as far as I know. Someone here probably has better information on the current architectures.
[go to top]