My read of this is that the paper demonstrates that given a particular model (and the problems examined with it) that giving more thought tokens does not help on problems above a certain complexity. It does not say anything about the capabilities of future, larger, models to handle more complex tasks. (NB: humans trend similarly)
My concern is that people are extrapolating from this to conclusions about LLM's generally, and this is not warranted
The only part about this i find even surprising is he abstract's conclusion (1): that 'thinking' can lead to worse outcomes for certain simple problem. (again though, maybe you can say humans are the same here. You can overthink things)
That is not a model-specific claim, it's a claim on the nature of LLMs.
For your argument to be true would need to mean that there is a qualitative difference, in which some models possess "true reasoning" capability and some don't, and this test only happened to look at the latter.
Furthermore we have clearly seen increases in reasoning from previous frontier models to current frontier models.
If the authors could /did show that both previous-generation and current-generation frontier models hit a wall at similar complexity that would be something, AFAIK they do not.