Advancing AI Benchmarking with Game Arena

>>salkah+(OP)
If AI can program, why does it matter if it can play Chess using CoT when it can program a Chess Engine instead? This applies to other domains as well.

>>10xDev+td
Its the same reason we are asked to write exams without using calculators but the real world does have them.

How you work without calculators is a proxy for real world competency.

>>simian+tj
Funny, you used probably the most useless form of benchmarking used on people as an example of measuring "competency" in the real world.

>>10xDev+Gk
are you in favour of children using calculators in exams?

>>simian+Uk
It is a program. I need it to get task X done and I don't care how, whether it is strictly through CoT or with tools. There is no such thing as cheating in real work and no reason to handicap it. Just test the limits of what it can do with whatever means possible.

Trying to solve everything with CoT alone without utilising tools seems futile.

>>10xDev+Dl
you are not understanding. its a proxy for how well it does other things.

>>simian+jp
A good proxy is knowing which tools to use to solve the problem. Not how to try and emulate how a human would play chess. That is pointless...

>>10xDev+XS
According to you, it says nothing about a person if they are good at chess

zlacker