72.7% Gemini 3 Pro
11.4% Gemini 2.5 Pro
49.9% Claude Opus 4.5
3.50% GPT-5.1
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer UseGemini 3 Pro has been making steady progress (12/16 badges) while Gemini 2.5 Pro is stuck (3/16 badges) despite using double the turns and tokens.
Source video title: Zelda: Breath of the Wild - Opening five minutes of gameplay
https://www.youtube.com/watch?v=xbt7ZYdUXn8
Prompt:
Please describe what happening in each scene of this video.
List scenes with timestamp, then describe separately:
- Setup and background, colors
- What is moving, what appear
- What objects in this scene and what is happening,
Basically make desceiption of 5 minutes video for a person who cant watch it.
Result on github gist since there too much text:https://gist.github.com/ArseniyShestakov/43fe8b8c1dca45eadab...
I'd say thi is quite accurate.
Here's the output from two tests I ran:
1. Asking Nano Banana Pro to solve the word search puzzle directly [1].
2. Asking Nano Banana Pro to highlight each word on the grid, with the position of every word included as part of the prompt [2].
The fact that it gets 2 words correct demonstrates meaningful progress, and it seems like we're really close to having a model that can one-shot this problem soon.
There's actually a bit of nuance required to solve this puzzle correctly which an older Gemini model struggled to do without additional nudging. You have to convert the grid or word list to use matching casing (the grid uses uppercase, the word list uses lowercase), and you need to recognize that "soup mix" needs to have the space removed when doing the search.
Here’s how Nano Banana fared: https://x.com/danielvaughn/status/1971640520176029704?s=46
https://gist.github.com/ArseniyShestakov/47123ce2b6b19a8e6b3...
Only one model (gpt-image-1) out of the 18 tested managed to pass the test successfully. Gemini 3.0 Pro got VERY close.
I passed the AIs this image and asked them how many fingers were on the hands: https://media.post.rvohealth.io/wp-content/uploads/sites/3/2...
Claude said there were 3 hands and 16 fingers. GPT said there are 10 fingers. Grok impressively said "There are 9 fingers visible on these two hands (the left hand is missing the tip of its ring finger)." Gemini smashed it and said 12.
https://gemini.google.com/share/e7a8b902ff67
But it didn't actually think it had done so, aligning with your observations. The last bits of its thinking were pretty sad (for it):
Analyzing Missing Anomaly
I'm focusing on the discrepancy - the persistent absence of the fifth leg in the generated images. Despite multiple attempts, the output consistently depicts a four-legged dog, contrary to the explicit requirement. This ongoing failure necessitates a shift in approach.
Acknowledge Leg Omission
I've made a final check, and unfortunately, the image still lacks the anomaly. The consistent absence of the fifth leg necessitates admitting my limitation. I'll provide the best image I could generate, clearly stating the missing feature.
```
Create a devenv project that does the following:
- Read the image at maze.jpg
- Write a script that solves the maze in the most optimal way between the mouse and the cheese
- Generate a new image which is of the original maze, but with a red line that represents the calculated path
Use whatever lib/framework is most appropriate```
Output: https://gist.github.com/J-Swift/ceb1db348f46ba167948f734ff0fc604
Solution: https://imgur.com/a/bkJloPT Place sneakers on all of its legs.
It'll get this correct a surprising number of times (tested with BFL Flux2 Pro, and NB Pro).Video: Zelda TOTK, R5 5600X, GTX 1650, 1080p 10 Minute Gameplay, No Commentary
https://www.youtube.com/watch?v=wZGmgV-8Rbo
Here can be found narrative descriprion source and command:
https://gist.github.com/ArseniyShestakov/47123ce2b6b19a8e6b3...
Then I converted it into narrative voice over with Gemini 2.5 Pro TTS:
https://drive.google.com/file/d/1Js2nDtM7sx14I43UY2PEoV5PuLM...
It's somewhat desynced from original video and voice over take 9 and half minutes instead of 10 in video, but description of what happening on screen is quite accurate.
PS: I used 144p video so details could be also messed up because of poor quality. And ofc I specifically asked for narrative-like descripription
According to the calculator on the pricing page (it's inside a toggle at the bottom of the FAQs), GPT-5 is resizing images to have a minor dimension of at most 768: https://openai.com/api/pricing/ That's ~half the resolution I would normally use for OCR, so if that's happening even via the API then I guess it makes sense it performs so poorly.
https://gemini.google.com/share/b3b68deaa6e6
I thought giving it a setting would help, but just skip that first response to see what I mean.
https://chatgpt.com/share/6933c848-a254-8010-adb5-8f736bdc70...
This is the SVG it created.
Gemini responds:
Conceptualizing the "Millipup"
https://gemini.google.com/share/b6b8c11bd32f
Draw the five legs of a dog as if the body is a pentagon
https://gemini.google.com/share/d74d9f5b4fa4
And animal legs are quite standardized
https://en.wikipedia.org/wiki/List_of_animals_by_number_of_l...
It's all about the prompt. Example:
Can you imagine a dog with five legs?
https://gemini.google.com/share/2dab67661d0e
And generally, the issue sits between the computer and the chair.
;-)
> Gemini models are trained on a dataset that is both multimodal and multilingual. Our pre-training dataset uses data from web documents, books, and code, and includes image, audio, and video data.
I wonder if “How many legs do you see?” is close enough to “How many lights do you see?” that the LLMs are responding based on the memes surrounding the Star Trek episode “Chain of Command”.
Represent the maze as a sequence of movements which either continue or end up being forced to backtrack.
Basically it would represent the maze as a graph and do a depth-first search, keeping track of what nodes it as visited in its reasoning tokens.
See for example https://stackoverflow.com/questions/3097556/programming-theo... where the solution is represented as:
A B D (backtrack) E H L (backtrack) M * (backtrack) O (backtrack thrice) I (backtrack thrice) C F (backtrack) G J
Asymmetry is as hard for AI models as it is for evolution to "prompt for" but they're getting better at it.
https://imagazine.pl/wp-content/uploads/2024/12/ugly-AI-pic-...
GPT-5 couldn't do it.
https://arxiv.org/abs/2407.01392
of course it doesn't redraw the image on every step, so not exactly what you're suggesting (interesting idea btw) but i think it's relevant.
https://chat.vlm.run/c/62394973-a869-4a54-a7f5-5f3bb717df5f
Here is the though process summary(you can see the full thinking the link above):
"I have attempted to generate a dog with 5 legs multiple times, verifying each result. Current image generation models have a strong bias towards standard anatomy (4 legs for dogs), making it difficult to consistently produce a specific number of extra limbs despite explicit prompts."
If you can‘t tell I find issues when terms are taken from psychology and applied to statistics. The terminology should flow in the other direction, from statistics and into psychology.
So my background is that I have done both undergraduate in both psychology and in statistics (though I dropped out of statistics after 2 years) and this is the first time I hear about artificial cognition, so I don‘t think this term is popular, and a short internet search seems to confirm that suspicion.
Out of context I would guess artificial cognition would mean something similar to cognition as artificial neural networks do to neural networks, that is, these are models that simulate the mechanisms of human cognition and recreate some stimulus → response loop. However my internet search revealed (thankfully) that this is not how researches are using this (IMO misguided) term.
https://psycnet.apa.org/record/2020-84784-001
https://arxiv.org/abs/1706.08606
What the researchers mean by the term (at least the ones I found in my short internet search) is not actual machine cognition, nor claims that machines have cognition, but rather an approach of research which takes experimental designs from cognitive psychology and applies them to learning models.
That's great, but it's demonstrably false.
I can write code that calculates the average letter frequency across any Wikipedia article. I can't do that in my head without tools because of the rule of seven[1].
Tool use is absolutely an intelligence amplifier but it isn't the same thing.
> Because again, the actual “model” is just a text autocomplete engine and it generates from left to right.
This is technically true, but somewhat misleading. Humans speak "left to right" too. Specifically, LLMs do have some spatial reasoning ability (which is what you'd expect with RL training: otherwise they'd just predict the most popular token): https://snorkel.ai/blog/introducing-snorkelspatial/
[1] https://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus...