zlacker

This feels a bit like a weird way to test 'thinking' in models, and reminds me of the old story of Gauss[1] and his classmates being assigned the task of adding up the numbers from 1-100.

I think the way the paper lays out the performance regimes is pretty interesting, but I don't think they achieved their goal of demonstrating that LRMs can't use reasoning to solve complex puzzles organically (without contamination/memorization): IMO testing the model's ability to define an algorithm to solve the puzzle would have been a better evaluation of that (rather than having the model walk through all of the steps manually). I don't know that I'd use an LRM for this sort of long-tail reasoning where it has to follow one single process for a long time over just one prompt; if I needed a really long chain of reasoning I'd use an agent or workflow.

It sounds more like the tests measure a model's ability to reason coherently and consistently over many steps rather than a model's ability to understand and solve a complex puzzle. For example, for the Tower of Hanoi, a prompt like "Define an algorithm that will find the sequence of moves to transform the initial configuration into the goal configuration" (e.g. "find an arithmetic series formula, young Gauss") seems like it would have been a better approach than "Find the sequence of moves to transform the initial configuration into the goal configuration" (e.g. "add up all these numbers"). This is kind of seen in how the study included a step where the LRMs were given the algorithm and then asked to solve the problem, the focus was on an LRM's ability to follow the steps, not their ability to come up with an algorithm/solution on their own.

In a job interview, for example, who among us would accept inability to hold all of the `(2^n) - 1` steps of the Tower of Hanoi in our brain as evidence of poor reasoning ability?

Again, I think it's a really interesting study covering a model's ability to consistently follow a simple process over time in pursuit of a static objective (and perhaps a useful benchmark moving forward), but I'm not confident that it successfully demonstrates a meaninful deficiency in overall reasoning capability.

[1]: https://www.americanscientist.org/article/gausss-day-of-reck...