+------------------------------+---------+--------------+
| Benchmark | o3 | Gemini 2.5 |
| | | Pro |
+------------------------------+---------+--------------+
| ARC-AGI (High Compute) | 87.5% | — |
| GPQA Diamond (Science) | 87.7% | 84.0% |
| AIME 2024 (Math) | 96.7% | 92.0% |
| SWE-bench Verified (Coding) | 71.7% | 63.8% |
| Codeforces Elo Rating | 2727 | — |
| MMMU (Visual Reasoning) | 82.9% | 81.7% |
| MathVista (Visual Math) | 86.8% | — |
| Humanity’s Last Exam | 26.6% | 18.8% |
+------------------------------+---------+--------------+
[1] https://storage.googleapis.com/model-cards/documents/gemini-...If you're using these models to generate code daily, the costs add up.
Sure, I'll give a really tough problem to o3 (and probably over ChatGPT, not the API), but on general code tasks, there really isn't meaningful enough difference to justify 4x the cost.