This year honestly feels quite stagnant. LLMs are literally technology that can only reproduce the past. They're cool, but they were way cooler 4 years ago. We've taken big ideas like "agents" and "reinforcement learning" and basically stripped them of all meaning in order to claim progress.
I mean, do you remember Geoffrey Hinton's RBM talk at Google in 2010? [0] That was absolutely insane for anyone keeping up with that field. By the mid-twenty teens RBMs were already outdated. I remember when everyone was implementing flavors of RNNs and LSTMs. Karpathy's character 2015 RNN project was insane [1].
This comment makes me wonder if part of the hype around LLMs is just that a lot of software people simply weren't paying attention to the absolutely mind-blowing progress we've seen in this field for the last 20 years. But even ignoring ML, the world's of web development and mobile application development have gone through incredible progress over the last decade and a half. I remember a time when JavaScript books would have a section warning that you should never use JS for anything critical to the application. Then there's the work in theorem provers over the last decade... If you remember when syntactic sugar was progress, either you remember way further back than I do, or you weren't paying attention to what was happening in the larger computing world.
Funny, I've used them to create my own personalized text editor, perfectly tailored to what I actually want. I'm pretty sure that didn't exist before.
It's wild to me how many people who talk about LLM apparently haven't learned how to use them for even very basic tasks like this! No wonder you think they're not that powerful, if you don't even know basic stuff like this. You really owe it to yourself to try them out.
I've worked at multiple AI startups in lead AI Engineering roles, both working on deploying user facing LLM products and working on the research end of LLMs. I've done collaborative projects and demos with a pretty wide range of big names in this space (but don't want to doxx myself too aggressively), have had my LLM work cited on HN multiple times, have LLM based github projects with hundreds of stars, appeared on a few podcasts talking about AI etc.
This gets to the point I was making. I'm starting to realize that part of the disconnect between my opinions on the state of the field and others is that many people haven't really been paying much attention.
I can see if recent LLMs are your first intro to the state of the field, it must feel incredible.
The change hit us so fast a huge number of people don’t understand how capable it is yet.
Also it certainly doesn’t help that it still hallucinates. One mistake and it’s enough to set someone against LLMs. You really need to push through that hallucinations are just the weak part of the process to see the value.
Either that, or they tried it "last year" or "a while back" and have no concept of how far things have gone in the meantime.
It's like they wandered into a machine shop, cut off a finger or two, and concluded that their grandpa's hammer and hacksaw were all anyone ever needed.
SWEs are trained to discard surface-level observations and be adversarial. You can't just look at the happy path, how does the system behave for edge cases? Where does it break down and how? What are the failure modes?
The actual analogy to a machine shop would be to look at whether the machines were adequate for their use case, the building had enough reliable power to run and if there were any safety issues.
It's easy to Clever Hans yourself and get snowed by what looks like sophisticated effort or flat out bullshit. I had to gently tell a junior engineer that just because the marketing claims something will work a certain way, that doesn't mean it will.
The key point you’re missing is the type of failure. Search systems fail by not retrieving. Parrots fail by repeating. LLMs fail by producing internally coherent but factually wrong world models. That failure mode only exists if the system is actually modeling and reasoning, imperfectly. You don’t get that behavior from lookup or regurgitation.
This shows up concretely in how errors scale. Ambiguity and multi-step inference increase hallucinations. Scaffolding, tools, and verification loops reduce them. Step-by-step reasoning helps. Grounding helps. None of that makes sense for a glorified Google search.
Hallucinations are a real weakness, but they’re not evidence of absence of capability. They’re evidence of an incomplete reasoning system operating without sufficient constraints. Engineers don’t dismiss CNC machines because they crash bits. They map the envelope and design around it. That’s what’s happening here.
Being skeptical of reliability in specific use cases is reasonable. Concluding from those failure modes that this is just Clever Hans is not adversarial engineering. It’s stopping one layer too early.
Absolutely not true. I cannot express how strongly this is not true, haha. The tech is neat, and plenty of real computer scientists work on it. That doesn't mean it's not wildly misunderstood by others.
> Concluding from those failure modes that this is just Clever Hans is not adversarial engineering.
I feel like you're maybe misunderstanding what I mean when I refer to Clever Hans. The Clever Hans story is not about the horse. It's about the people.
A lot of people -- including his owner-- were legitimately convinced that a horse could do math, because look, literally anyone can ask the horse questions and it answers them correctly. What more proof do you need? It's obvious he can do math.
Except of course it's not true lol. Horses are smart critters, but they absolutely cannot do arithmetic no matter how much you train them.
The relevant lesson here is it's very easy to convince yourself you saw something you 100% did not see. (It's why magic shows are fun.)
These things are not horses. How can anyone choose to remain so ignorant in the face of irrefutable evidence that they're wrong?
https://arxiv.org/abs/2507.15855
It's as if a disease like COVID swept through the population, and every human's IQ dropped 10 to 15 points while our machines grew smarter to an even larger degree.
The first thing I checked was "how did they verify the proofs were correct" and the answer was they got other AI people to check it, and those people said there were serious problems with the paper's methodology and it would not be a gold medal.
https://x.com/j_dekoninck/status/1947587647616004583
This is why we do not take things at face value.
Gemini 2.5 has since been superceded by 3.0, which is less likely to need hints. 2.5 was not as strong as the contemporary GPT model, but 3.0 with Pro Thinking mode enabled is up there with the best.
Finally, saying, "Well, they were given some hints" is like me saying, "LOL, big deal, I could drag a Tour peleton up Col du Galibier if I were on the same drugs Lance was using."
No, in fact I could do no such thing, drugs or no drugs. Similarly, a model that can't legitimately reason will not be able to solve these types of problems, even if given hints.