zlacker

A few observations: 1) the performance vs complexity curve looks very similar to that for most humans (having seen groups attempt Towers of Hanoi with 5 car tires) haha 2) models can trivially solve some of these tasks when given tools 3) this is an internship paper with some quirks that many mostly dismissed, but is being quoted everywhere as "Apple proves LLMs can't ever reason"

Anyway, fun experiment to test your understanding of these things but don't take any conclusions as gospel :)