Gemini 3 Pro: the frontier of vision AI

>>xnx+(OP)
In case the article author sees this, the "HTML transcription" link is broken - it goes to https://aistudio-preprod.corp.google.com/prompts/1GUEWbLIlpX... which is a Google-employee-only URL.

>>xnx+(OP)
These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this from Tesseract. I wonder what the cost would be, both in raw cost to run, and via a paid API, to do that.

[0] https://annas-archive.org/blog/critical-window.html

>>xnx+(OP)
Interesting "ScreenSpot Pro" results:

    72.7% Gemini 3 Pro
    11.4% Gemini 2.5 Pro
    49.9% Claude Opus 4.5
    3.50% GPT-5.1

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

https://arxiv.org/abs/2504.07981

>>iamjac+AI
Gemini 3 Pro has been playing Pokemon Crystal (which is significantly harder than Red) in a race against Gemini 2.5 Pro: https://www.twitch.tv/gemini_plays_pokemon

Gemini 3 Pro has been making steady progress (12/16 badges) while Gemini 2.5 Pro is stuck (3/16 badges) despite using double the turns and tokens.

>>pseudo+LP
The actual token calculations with input videos for Gemini 3 Pro is...confusing.

https://ai.google.dev/gemini-api/docs/media-resolution

>>djoldm+1H
I was surprised at how poorly GPT-5 did in comparison to Opus 4.1 and Gemini 2.5 on a pretty simple OCR task a few months ago - I should run that again against the latest models and see how they do. https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-cod...

>>devinp+jP
Hey, I just made simple test on 5 minute downloaded YouTube video uploading it to Gemini app.

Source video title: Zelda: Breath of the Wild - Opening five minutes of gameplay

https://www.youtube.com/watch?v=xbt7ZYdUXn8

Prompt:

   Please describe what happening in each scene of this video.
   
   List scenes with timestamp, then describe separately:
   - Setup and background, colors
   - What is moving, what appear
   - What objects in this scene and what is happening,
   
   Basically make desceiption of 5 minutes video for a person who cant watch it.

Result on github gist since there too much text:

https://gist.github.com/ArseniyShestakov/43fe8b8c1dca45eadab...

I'd say thi is quite accurate.

>>xnx+(OP)
Since I think it's interesting to highlight the jagged intelligence, I have a simple word search puzzle [0] that Nano Banana Pro stills struggles to solve correctly. Gemini 3 Pro with Code Execution is able to one-shot the problem and find the positions of each word (this is super impressive! one year ago it wasn't possible), but Nano Banana Pro fails to highlight the words correctly.

Here's the output from two tests I ran:

1. Asking Nano Banana Pro to solve the word search puzzle directly [1].

2. Asking Nano Banana Pro to highlight each word on the grid, with the position of every word included as part of the prompt [2].

The fact that it gets 2 words correct demonstrates meaningful progress, and it seems like we're really close to having a model that can one-shot this problem soon.

There's actually a bit of nuance required to solve this puzzle correctly which an older Gemini model struggled to do without additional nudging. You have to convert the grid or word list to use matching casing (the grid uses uppercase, the word list uses lowercase), and you need to recognize that "soup mix" needs to have the space removed when doing the search.

[0] https://imgur.com/ekwfHrN

[1] https://imgur.com/1nybezU

[2] https://imgur.com/18mK5i5

>>Workac+cU
I don’t know much about AI, but I have this image test that everything has failed at. You basically just present an image of a maze and ask the LLM to draw a line through the most optimal path.

Here’s how Nano Banana fared: https://x.com/danielvaughn/status/1971640520176029704?s=46

>>SXX+gW
Another example with completely random 10 minute benchmark video from Tears of Kingdom:

https://gist.github.com/ArseniyShestakov/47123ce2b6b19a8e6b3...

>>daniel+dY
In fact, one of the tests I use as part of GenAI Showdown involves both parts of the puzzle: draw a maze with a clearly defined entrance and exit, along with a dashed line indicating the solution to the maze.

Only one model (gpt-image-1) out of the 18 tested managed to pass the test successfully. Gemini 3.0 Pro got VERY close.

https://genai-showdown.specr.net/#the-labyrinth

>>Workac+cU
Super interesting. I replicated this.

I passed the AIs this image and asked them how many fingers were on the hands: https://media.post.rvohealth.io/wp-content/uploads/sites/3/2...

Claude said there were 3 hands and 16 fingers. GPT said there are 10 fingers. Grok impressively said "There are 9 fingers visible on these two hands (the left hand is missing the tip of its ring finger)." Gemini smashed it and said 12.

>>grog45+L51
Sounds like they used GenAI to make them. The "Editor" models (Seedream, Nano-Banana) can easily integrate a fifth limb to create the "dog with awkward walking animation".

https://imgur.com/a/wXQskhL

>>vunder+E41
https://gemini.google.com/share/8cef4b408a0a

Surprisingly, it got all of them right

>>Workac+cU
Nano Banana 2 actually did generate a pretty good looking 5-legged dog for me.

https://gemini.google.com/share/e7a8b902ff67

But it didn't actually think it had done so, aligning with your observations. The last bits of its thinking were pretty sad (for it):

Analyzing Missing Anomaly

I'm focusing on the discrepancy - the persistent absence of the fifth leg in the generated images. Despite multiple attempts, the output consistently depicts a four-legged dog, contrary to the explicit requirement. This ongoing failure necessitates a shift in approach.

Acknowledge Leg Omission

I've made a final check, and unfortunately, the image still lacks the anomaly. The consistent absence of the fifth leg necessitates admitting my limitation. I'll provide the best image I could generate, clearly stating the missing feature.

>>daniel+dY
I just oneshot it with claude code (opus 4.5) using this prompt. It took about 5 mins and included detecting that it was cheating at first (drew a line around the boundary of the maze instead), so it added guardrails for that:

```

Create a devenv project that does the following:

  - Read the image at maze.jpg
  - Write a script that solves the maze  in the most optimal way between the mouse and the cheese
  - Generate a new image which is of the original maze, but with a red line that represents the calculated path

Use whatever lib/framework is most appropriate

```

  Output: https://gist.github.com/J-Swift/ceb1db348f46ba167948f734ff0fc604  
  Solution: https://imgur.com/a/bkJloPT

>>Rover2+x01
If you want to see something rather amusing - instead of using the LLM aspect of Gemini 3.0 Pro, feed a five-legged dog directly into Nano Banana Pro and give it an editing task that requires an intrinsic understanding of the unusual anatomy.

  Place sneakers on all of its legs.

It'll get this correct a surprising number of times (tested with BFL Flux2 Pro, and NB Pro).

https://imgur.com/a/wXQskhL

>>majorm+Y91
I’m inclined to buy the RL story, since the image gen “deep dream” models of ~10 years ago would produce dogs with TRILLIONS of eyes: https://doorofperception.com/2015/10/google-deep-dream-incep...

>>devinp+jP
BTW I asked detailed narrative descriprion of other purely benchmarking Zelda video with 5 second snapshots:

Video: Zelda TOTK, R5 5600X, GTX 1650, 1080p 10 Minute Gameplay, No Commentary

https://www.youtube.com/watch?v=wZGmgV-8Rbo

Here can be found narrative descriprion source and command:

https://gist.github.com/ArseniyShestakov/47123ce2b6b19a8e6b3...

Then I converted it into narrative voice over with Gemini 2.5 Pro TTS:

https://drive.google.com/file/d/1Js2nDtM7sx14I43UY2PEoV5PuLM...

It's somewhat desynced from original video and voice over take 9 and half minutes instead of 10 in video, but description of what happening on screen is quite accurate.

PS: I used 144p video so details could be also messed up because of poor quality. And ofc I specifically asked for narrative-like descripription

>>simonw+SR
Agreed, GPT-5 and even 5.1 is noticeably bad at OCR. OCRArena backs this up: https://www.ocrarena.ai/leaderboard (I personally would rank 5.1 as even worse than it is there).

According to the calculator on the pricing page (it's inside a toggle at the bottom of the FAQs), GPT-5 is resizing images to have a minor dimension of at most 768: https://openai.com/api/pricing/ That's ~half the resolution I would normally use for OCR, so if that's happening even via the API then I guess it makes sense it performs so poorly.

>>macNch+Fs1
Right you are. It can do 26 hours just fine, but appears completely incapable when the layout would be too close to a normal clock.

https://gemini.google.com/share/b3b68deaa6e6

I thought giving it a setting would help, but just skip that first response to see what I mean.

>>hodder+xP
It works fine for me. https://imgur.com/a/MKNufm1

>>aziis9+Xc1
Simon Wilson has some good blogs on this: https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...

>>reduce+d51
Carl Sagan has entered the chat: https://www.youtube.com/watch?v=6_-jtyhAVTc&t=450s

>>macNch+Fs1
It was ugly. But I got ChatGPT to cheat and do it

https://chatgpt.com/share/6933c848-a254-8010-adb5-8f736bdc70...

This is the SVG it created.

https://imgur.com/a/LLpw8YK

>>Rover2+x01
Draw a millipede as a dog:

Gemini responds:

Conceptualizing the "Millipup"

https://gemini.google.com/share/b6b8c11bd32f

Draw the five legs of a dog as if the body is a pentagon

https://gemini.google.com/share/d74d9f5b4fa4

And animal legs are quite standardized

https://en.wikipedia.org/wiki/List_of_animals_by_number_of_l...

It's all about the prompt. Example:

Can you imagine a dog with five legs?

https://gemini.google.com/share/2dab67661d0e

And generally, the issue sits between the computer and the chair.

;-)

>>bee_ri+dh1
No, the "large _language_ model" name is a misnomer nowadays. Some time ago it was indeed common to get a pure-text model and inject embeddings from a separately trained image-encoder (which generated "meh" results), but current natively multi-modal models are pre-trained with both text and images from the ground-up. That's why they are so much better at image understanding.

> Gemini models are trained on a dataset that is both multimodal and multilingual. Our pre-training dataset uses data from web documents, books, and code, and includes image, audio, and video data.

https://arxiv.org/pdf/2312.11805

>>Workac+cU
> It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.

I wonder if “How many legs do you see?” is close enough to “How many lights do you see?” that the LLMs are responding based on the memes surrounding the Star Trek episode “Chain of Command”.

https://youtu.be/S9brF-wlja8

>>JamesS+l42
> So, what could a model _possibly_ be able to do for this puzzle which is "fair game" as a valid solution, other than magically know an answer by pulling it out of thin air?

Represent the maze as a sequence of movements which either continue or end up being forced to backtrack.

Basically it would represent the maze as a graph and do a depth-first search, keeping track of what nodes it as visited in its reasoning tokens.

See for example https://stackoverflow.com/questions/3097556/programming-theo... where the solution is represented as:

A B D (backtrack) E H L (backtrack) M * (backtrack) O (backtrack thrice) I (backtrack thrice) C F (backtrack) G J

>>Workac+cU
I just asked Gemini Pro to put bounding boxes on the hippocampus from a coronal slice of a brain MRI. Complete fail. There has to be thousands of pictures of coronal brain slices with hippocampal labels out there, but apparently it learned none of it...unless I am doing it wrong.

https://i.imgur.com/1XxYoYN.png

>>Subicu+tJ2
asked nanobanana to paint the hippocampus red...better, but not close to good. https://imgur.com/a/clwNg1h

>>Subicu+4K2
I was a little hopeful when I tried again, but it really seems that it didn't know what it looks like. Maybe few shot with examples?

https://gemini.google.com/share/137812b95b5e

>>theoa+Ga2
This is basically the "Rhinos are just fat unicorns" approach. Totally fine if you want to go that route but a bit goofy. You can get SOTA models to generate a 5-legged dog simply by being more specific about the placement of the fifth leg.

https://imgur.com/a/jNj98Pc

Asymmetry is as hard for AI models as it is for evolution to "prompt for" but they're getting better at it.

>>Workac+cU
Gemini 3 Pro correctly counted the fingers in this picture:

https://imagazine.pl/wp-content/uploads/2024/12/ugly-AI-pic-...

GPT-5 couldn't do it.

>>jiggaw+V11
you're kind of describing the figure in table 1 (page 8) of the diffusion forcing paper

https://arxiv.org/abs/2407.01392

of course it doesn't redraw the image on every step, so not exactly what you're suggesting (interesting idea btw) but i think it's relevant.

>>Rover2+x01
I tried this by using an gemini visual agent build with orion from vlm.run. it was able to produce two different images with five leg dog. you need to make it play with itself to improve and correct.

https://chat.vlm.run/c/62394973-a869-4a54-a7f5-5f3bb717df5f

Here is the though process summary(you can see the full thinking the link above):

"I have attempted to generate a dog with 5 legs multiple times, verifying each result. Current image generation models have a strong bias towards standard anatomy (4 legs for dogs), making it difficult to consistently produce a specific number of extra limbs despite explicit prompts."

>>Kiro+CD3
Yeah, I looked it up yesterday and saw that artificial cognition is a thing, though I must say I am not a fan and I certainly hope this term does not catch. We are already knee deep in bad terminology because of artificial intelligence (“intelligence” already being extremely problematic even with out the “artificial” qualifier in psychology) and machine learning (the latter being infinitely better but still not without issues).

If you can‘t tell I find issues when terms are taken from psychology and applied to statistics. The terminology should flow in the other direction, from statistics and into psychology.

So my background is that I have done both undergraduate in both psychology and in statistics (though I dropped out of statistics after 2 years) and this is the first time I hear about artificial cognition, so I don‘t think this term is popular, and a short internet search seems to confirm that suspicion.

Out of context I would guess artificial cognition would mean something similar to cognition as artificial neural networks do to neural networks, that is, these are models that simulate the mechanisms of human cognition and recreate some stimulus → response loop. However my internet search revealed (thankfully) that this is not how researches are using this (IMO misguided) term.

https://psycnet.apa.org/record/2020-84784-001

https://arxiv.org/abs/1706.08606

What the researchers mean by the term (at least the ones I found in my short internet search) is not actual machine cognition, nor claims that machines have cognition, but rather an approach of research which takes experimental designs from cognitive psychology and applies them to learning models.

>>JamesS+PE2
> In my opinion, being able to write the code to do the thing is effectively the same exact thing as doing the thing

That's great, but it's demonstrably false.

I can write code that calculates the average letter frequency across any Wikipedia article. I can't do that in my head without tools because of the rule of seven[1].

Tool use is absolutely an intelligence amplifier but it isn't the same thing.

> Because again, the actual “model” is just a text autocomplete engine and it generates from left to right.

This is technically true, but somewhat misleading. Humans speak "left to right" too. Specifically, LLMs do have some spatial reasoning ability (which is what you'd expect with RL training: otherwise they'd just predict the most popular token): https://snorkel.ai/blog/introducing-snorkelspatial/

[1] https://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus...

zlacker

Gemini 3 Pro: the frontier of vision AI