I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.
But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.
We keep assigning adjectives to this technology that anthropomorphize the neat tricks we've invented. There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.
All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.
This is a neat trick, but it doesn't solve the underlying problems that plague these models like hallucination. If the "reasoning" process contains garbage, gets stuck in loops, etc., the final answer will also be garbage. I've seen sessions where the model approximates the correct answer in the first "reasoning" step, but then sabotages it with senseless "But wait!" follow-up steps. The final answer ends up being a mangled mess of all the garbage it generated in the "reasoning" phase.
The only reason we keep anthropomorphizing these tools is because it makes us feel good. It's wishful thinking that markets well, gets investors buzzing, and grows the hype further. In reality, we're as close to artificial intelligence as we were a decade ago. What we do have are very good pattern matchers and probabilistic data generators that can leverage the enormous amount of compute we can throw at the problem. Which isn't to say that this can't be very useful, but ascribing human qualities to it only muddies the discussion.
Computers can't think and submarines can't swim.
So just like computers are better at humans at multiplying numbers, there are still many things we need human intelligence for even in todays era of LLM.
So if an LLM generates working code, correct translations, valid points relating to complex matters and so on it doesn't matter if it does so by thinking or by some other mechanism.
I think that's an interesting point.
But the point is that the desired result isn't achieved, we still need humans to think.
So we still need a word for what humans do that is different from what LLM does. If you are saying there is no difference then how do you explain the vast difference in capability between humans and LLM models?
Submarines and swimming is a great metaphor for this, since Submarines clearly doesn't swim and thus have very different abilities in water, its way better in some ways but way worse in other ways. So using that metaphor its clear that LLM "thinking" cannot be described with the same words as human thinking since its so different.