The Illusion of Thinking: Strengths and limitations of reasoning models [pdf]

>>amrrs+(OP)
> Rather than standard benchmarks (e.g., math problems), we adopt controllable puzzle environments that let us vary complexity systematically

Very clever, I must say. Kudos to folks who made this particular choice.

> we identify three performance regimes: (1) low complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.

This is fascinating! We need more "mapping" of regimes like this!

What I would love to see (not sure if someone on here has seen anything to this effect) is how these complexity regimes might map to economic value of the task.

For that, the eval needs to go beyond puzzles but the complexity of the tasks still need to be controllable.

>>curiou+eK
Using puzzles is not special or anything, it has been done a million times since before (and including) the LSTM paper (1997) https://www.bioinf.jku.at/publications/older/2604.pdf

>>make3+Uy4
The Arc Prize just released a new update and it's all minigame puzzles

https://arcprize.org/

zlacker