I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.
But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.
We keep assigning adjectives to this technology that anthropomorphize the neat tricks we've invented. There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.
All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.
This is a neat trick, but it doesn't solve the underlying problems that plague these models like hallucination. If the "reasoning" process contains garbage, gets stuck in loops, etc., the final answer will also be garbage. I've seen sessions where the model approximates the correct answer in the first "reasoning" step, but then sabotages it with senseless "But wait!" follow-up steps. The final answer ends up being a mangled mess of all the garbage it generated in the "reasoning" phase.
The only reason we keep anthropomorphizing these tools is because it makes us feel good. It's wishful thinking that markets well, gets investors buzzing, and grows the hype further. In reality, we're as close to artificial intelligence as we were a decade ago. What we do have are very good pattern matchers and probabilistic data generators that can leverage the enormous amount of compute we can throw at the problem. Which isn't to say that this can't be very useful, but ascribing human qualities to it only muddies the discussion.
I disagree in that that seems quite a good way of describing them. All language is a bit inexact.
Also I don't buy we are no closer to AI than ten years ago - there seem lots going on. Just because LLMs are limited doesn't mean we can't find or add other algorithms - I mean look at alphaevolve for example https://www.technologyreview.com/2025/05/14/1116438/google-d...
>found a faster way to solve matrix multiplications—a fundamental problem in computer science—beating a record that had stood for more than 50 years
I figure it's hard to argue that that is not at least somewhat intelligent?
The fact that this technology can be very useful doesn't imply that it's intelligent. My argument is about the language used to describe it, not about its abilities.
The breakthroughs we've had is because there is a lot of utility from finding patterns in data which humans aren't very good at. Many of our problems can be boiled down to this task. So when we have vast amounts of data and compute at our disposal, we can be easily impressed by results that seem impossible for humans.
But this is not intelligence. The machine has no semantic understanding of what the data represents. The algorithm is optimized for generating specific permutations of tokens that match something it previously saw and was rewarded for. Again, very useful, but there's no thinking or reasoning there. The model doesn't have an understanding of why the wolf can't be close to the goat, or how a cabbage tastes. It's trained on enough data and algorithmic tricks that its responses can fool us into thinking it does, but this is just an illusion of intelligence. This is why we need to constantly feed it more tricks so that it doesn't fumble with basic questions like how many "R"s are in "strawberry", or that it doesn't generate racially diverse but historically inaccurate images.
How do you define "semantic understanding" in a way that doesn't ultimately boil down to saying they don't have phenomenal consciousness? Any functional concept of semantic understanding is captured to some degree by LLMs.
Typically when we attribute understanding to some entity, we recognize some substantial abilities in the entity in relation to that which is being understood. Specifically, the subject recognizes relevant entities and their relationships, various causal dependences, and so on. This ability goes beyond rote memorization, it has a counterfactual quality in that the subject can infer facts or descriptions in different but related cases beyond the subject's explicit knowledge. But LLMs excel at this.
>feed it more tricks so that it doesn't fumble with basic questions like how many "R"s are in "strawberry"
This failure mode has nothing to do with LLMs lacking intelligence and everything to do with how tokens are represented. They do not see individual characters, but sub-word chunks. It's like expecting a human to count the pixels in an image it sees on a computer screen. While not impossible, it's unnatural to how we process images and therefore error-prone.
LLMs are not consistent. This is unarguable. They will produce a string of text that says they have solved a problem and/or done a thing when neither is true.
And sometimes they will do it over and over, even when corrected.
Your last paragraph admits this.
Tokenisation on its own simply cannot represent reality accurately and reliably. It can be tweaked so that specific problems can appear solved, but true AI would be based on a reliable general strategy which solves entire classes of problems without needing this kind of tweaking.
It's clear we're nowhere close to that.
"True, LLMs can't do X, but a lot of people don't do X well either!"
The problem is, when you say humans have trouble with X, what you mean is that human brains are fully capable of X, but sometimes they do, indeed, make mistakes. Or that some humans haven't trained their faculties for X very well, or whatever.
But LLMs are fundamentally, completely, incapable of X. It is not something that can be a result of their processes.
These things are not comparable.
So, to your specific point: When an LLM is inconsistent, it is because it is, at its root, a statistical engine generating plausible next tokens, with no semantic understanding of the underlying data. When a human is inconsistent, it is because they got distracted, didn't learn enough about this particular subject, or otherwise made a mistake that they can, if their attention is drawn to it, recognize and correct.
LLMs cannot. They can only be told they made a mistake, which prompts them to try again (because that's the pattern that has been trained into them for what happens when told they made a mistake). But their next try won't have any better odds of being correct than their previous one.
This is the very point of contention. You don't get to just assume it.
> it is because it is, at its root, a statistical engine generating plausible next tokens, with no semantic understanding of the underlying data.
Another highly contentious point you are just outright assuming. LLMs are modelling the world, not just "predicting the next token". Some examples here[1][2][3]. Anyone claiming otherwise at this point is not arguing in good faith. It's interesting how the people with the strongest opinions about LLMs don't seem to understand them.
[1] https://arxiv.org/abs/2405.15943
[2] https://x.com/OwainEvans_UK/status/1894436637054214509
[3] https://www.anthropic.com/research/tracing-thoughts-language...