We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an agent written by Claude plays poker against an agent written by GPT, and this really tough task leads to very interesting findings on AI for coding.
How you work without calculators is a proxy for real world competency.
Are you going to share those with the class or?
AI already has a very creative imagination for role play so this just adds extra to their arsenal.
Trying to solve everything with CoT alone without utilising tools seems futile.
Heh, we really did come full circle on this! When chatgpt launched in dec22 one of the first things that people noticed is that it sucked at math. Like basic math 12 + 35 would trip it up. Then people "discovered" tool use, and added a calculator. And everyone was like "well, that's cheating, of course it can use a calculator, but look it can't do the simple addition logic"... And now here we are :)
Bizarre.
And as a poker player, I can say that this game is much more challenging for computers than chess, writing a program that can play poker really well and efficiently is an unsolved problem.
Chess engines don’t grow on trees, they’re built by intelligent systems that can think, namely human brains.
Supposedly we want to build machines that can also think, not just regurgitate things created by human brains. That’s why testing CoT is important.
It’s not actually about chess, it’s about thinking and intelligence.
Maybe we should just get rid of tedious benchmarks like chess altogether at this point that is leading people to think of how to limit AI as a way of keeping it a relevant benchmark rather than expanding on what is already there.
That was a whole half a decade ago, but back then deep learning AIs were defeated very badly by handcrafted scripts. Even the best bot in the neural net category was actual a symbolic script/neural net hybrid.
It doesn't even need to be one tool but a series of tools.
Ultimately I think it's impossible to define AGI. Maybe "I know it when I see it"—except everyone sees it at a different point (evidently).
Gemini is consistently winning against top models
As someone who's been playing dota for nearly 20 years now, it was fascinating to watch it play. Some of it's decision making process didn't seem logical in the short term, but would often be set ups for future plays, even though their observation window was fairly small. Even more impressively was the ai bot changed the meta of professional players, since tactics that arose out of its training ended up being more optimal.
I wish we got to the point where other ai bots were out there, but it's entirely understandable that you couldn't drive a complex game like Dota with LLMs, whereas you can with the ones the Game Arena has selected.
There were two villagers and one werewolf. The werewolf started the round by saying I'm the werewolf vote for me and then the game ended with a villager win.
Over night he had successfully taken out the doctor. It made no sense in my opinion.
There were some funny bits like on of the Anthropics models forgetting a rule and leading to everyone accusing him of being a werewolf in a pile on. He wasn't a werewolf he genuinely forgot the rule. Happens nearly every human game of werewolf.