zlacker

[parent] [thread] 8 comments
1. djoldm+(OP)[view] [source] 2025-12-05 19:18:33
Interesting "ScreenSpot Pro" results:

    72.7% Gemini 3 Pro
    11.4% Gemini 2.5 Pro
    49.9% Claude Opus 4.5
    3.50% GPT-5.1
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

https://arxiv.org/abs/2504.07981

replies(3): >>agenti+g2 >>jasonj+N9 >>simonw+Ra
2. agenti+g2[view] [source] 2025-12-05 19:29:17
>>djoldm+(OP)
impressive.....most impressive

its going to reach low 90s very soon if trends continue

3. jasonj+N9[view] [source] 2025-12-05 20:07:51
>>djoldm+(OP)
That is... astronomically different. Is GPT-5.1 downscaling and losing critical information or something? How could it be so different?
replies(3): >>ericd+Rl >>zubiau+JH >>energy+DJ
4. simonw+Ra[view] [source] 2025-12-05 20:12:34
>>djoldm+(OP)
I was surprised at how poorly GPT-5 did in comparison to Opus 4.1 and Gemini 2.5 on a pretty simple OCR task a few months ago - I should run that again against the latest models and see how they do. https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-cod...
replies(1): >>daemon+fP
◧◩
5. ericd+Rl[view] [source] [discussion] 2025-12-05 21:07:34
>>jasonj+N9
I found much better results with smallish UI elements in large screenshots on GPT by slicing it up manually and feeding them one at a time. I think it does severely lossy downscaling.
◧◩
6. zubiau+JH[view] [source] [discussion] 2025-12-05 23:17:33
>>jasonj+N9
It has a rather poor max resolution. Higher resolution images get tiled up to a point. 512 x 512, I think is the max tile size, 2048 x 2048 the max canvas.
◧◩
7. energy+DJ[view] [source] [discussion] 2025-12-05 23:28:22
>>jasonj+N9
This is my default explanation for visual impairments in LLMs, they're trying to compress the image into about 3000 tokens, you're going to lose a lot in the name of efficiency.
◧◩
8. daemon+fP[view] [source] [discussion] 2025-12-06 00:16:35
>>simonw+Ra
Agreed, GPT-5 and even 5.1 is noticeably bad at OCR. OCRArena backs this up: https://www.ocrarena.ai/leaderboard (I personally would rank 5.1 as even worse than it is there).

According to the calculator on the pricing page (it's inside a toggle at the bottom of the FAQs), GPT-5 is resizing images to have a minor dimension of at most 768: https://openai.com/api/pricing/ That's ~half the resolution I would normally use for OCR, so if that's happening even via the API then I guess it makes sense it performs so poorly.

replies(1): >>datadr+FA2
◧◩◪
9. datadr+FA2[view] [source] [discussion] 2025-12-06 19:27:07
>>daemon+fP
and GPT4 was pretty decent at OCR, so that's weird?
[go to top]