Advancing AI Benchmarking with Game Arena

submitted by salkah+(OP) on 2026-02-02 17:49:07 | 134 points 54 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts

>>salkah+(OP)
This is a good way to benchmark models. We [the SWE-bench team] took the meta-version of this and implemented it as a new benchmark called CodeClash -

We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an agent written by Claude plays poker against an agent written by GPT, and this really tough task leads to very interesting findings on AI for coding.

https://codeclash.ai/

>>cv5005+k8
https://arxiv.org/abs/2507.03793

>>tiahur+J4
For reference for anyone who missed it, the 2021 NetHack challenge results: https://nethackchallenge.com/report.html

That was a whole half a decade ago, but back then deep learning AIs were defeated very badly by handcrafted scripts. Even the best bot in the neural net category was actual a symbolic script/neural net hybrid.

>>ofirpr+o7
https://ai.meta.com/research/publications/gaia-a-benchmark-f...

>>salkah+(OP)
Let's add NetHack to the mix!

https://kenforthewin.github.io/blog/posts/nethack-agent/

>>salkah+(OP)
Oh hey, I've been running Werewolves/Mafia games as benchmarks for a while now

https://mafia-arena.com

Gemini is consistently winning against top models

>>Rivier+OE
The most popular form was solved in 2019: https://en.wikipedia.org/wiki/Pluribus_(poker_bot)

>>salkah+(OP)
This was effectively what OpenAI did in the very early days with Dota 2: https://en.wikipedia.org/wiki/OpenAI_Five

As someone who's been playing dota for nearly 20 years now, it was fascinating to watch it play. Some of it's decision making process didn't seem logical in the short term, but would often be set ups for future plays, even though their observation window was fairly small. Even more impressively was the ai bot changed the meta of professional players, since tactics that arose out of its training ended up being more optimal.

I wish we got to the point where other ai bots were out there, but it's entirely understandable that you couldn't drive a complex game like Dota with LLMs, whereas you can with the ones the Game Arena has selected.

zlacker

Advancing AI Benchmarking with Game Arena