I've been hearing this for 2 years now
the previous model retroactively becomes total dogshit the moment a new one is released
convenient, isn't it?
Yes, it might make a difference, but it is a little tiresome that there's always a “this is based on a model that is x months old!” comment, because it will always be true: an academic study does not get funded, executed, written up, and published in less time.
Even then though, “technology gets better over time” shouldn’t be surprising, as it’s pretty common.
More generally, the phenomenon this is quite simply explained and nothing surprising: New things improve, quickly. That does not mean that something is good or valuable but it's how new tech gets introduced every single time, and readily explains changing sentiment.
Sure you may end up missing out on a good thing and then having to come late to the party, but coming early to the party too many times and the beer is watered down and the food has grubs is apt to make you cynical the next time a party announcement comes your way.
If you pay attention to who says it, you'll find that people have different personal thresholds for finding llms useful, not that any given person like steveklabnik above keeps flip-flopping on their view.
This is a variant on the goomba fallacy: https://englishinprogress.net/gen-z-slang/goomba-fallacy-exp...
For context, I've been using AI, a mix of OpenAi + Claude, mainly for bashing out quick React stuff. For over a year now. Anything else it's generally rubbish and slower than working without. Though I still use it to rubber duck, so I'm still seeing the level of quality for backend.
I'd say they're only marginally better today than they were even 2 years ago.
Every time a new model comes out you get a bunch of people raving how great the new one is and I honestly can't really tell the difference. The only real difference is reasoning models actually slowed everything down, but now I see its reasoning. It's only useful because I often spot it leaving out important stuff from the final answer.
"No, the 2.8 release is the first good one. It massively improves workflows"
Then, 6 months later, the study comes out.
"Ah man, 2.8 was useless, 3.0 really crossed the threshold on value add"
At some point, you roll your eyes and assume it is just snake oil sales
Like the boy who cried wolf, it'll eventually be true with enough time... But we should stop giving them the benefit of the doubt.
_____
Jan 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."
Feb 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."
Mar 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."
Apr 2025: [Ad nauseam, you get the idea]
Just two years ago, this failed.
> Me: What language is this: "esto está escrito en inglés"
> LLM: English
Gemini and Opus have solved questions that took me weeks to solve myself. And I'll feed some complex code into each new iteration and it will catch a race condition I missed even with testing and line by line scrutiny.
Consider how many more years of experience you need as a software engineer to catch hard race conditions just from reading code than someone who couldn't do it after trying 100 times. We take it for granted already since we see it as "it caught it or it didn't", but these are massive jumps in capability.
Of course it's possible that at some point you get to a model that really works, irrespective of the history of false claims from the zealots, but it does mean you should take their comments with a grain of salt.
As with anything, your miles may vary: I’m not here to tell anyone that thinks they still suck that their experience is invalid, but to me it’s been a pretty big swing.
Sure they may get even more useful in the future but that doesn’t change my present.
Every hype cycle feels like this, and some of them are nonsense and some of them are real. We’ll see.
(Unless one believes the most grandiose prophecies of a technological-singularity apocalypse, that is.)
* the release of agentic workflow tools
* the release of MCPs
* the release of new models, Claude 4 and Gemini 2.5 in particular
* subagents
* asynchronous agents
All or any of these could have made for a big or small impact. For example, I’m big on agentic tools, skeptical of MCPs, and don’t think we yet understand subagents. That’s different from those who, for example, think MCPs are the future.
> At some point, you roll your eyes and assume it is just snake oil sales
No, you have to realize you’re talking to a population of people, and not necessarily the same person. Opinions are going to vary, they’re not literally the same person each time.
There are surely snake oil salesman, but you can’t buy anything from me.
Right.
> except that that is the same thing the same people say for every model release,
I did not say that, no.
I am sure you can find someone who is in a Groundhog Day about this, but it’s just simpler than that: as tools improve, more people find them useful than before. You’re not talking to the same people, you are talking to new people each time who now have had their threshold crossed.
Same. For me the turning point was VS Code’s Copilot Agent mode in April. That changed everything about how I work, though it had a lot of drawbacks due to its glitches (many of these were fixed within 6 or so weeks).
When Claude Sonnet 4 came out in May, I could immediately tell it was a step-function increase in capability. It was the first time an AI, faced with ambiguous and complicated situations, would be willing to answer a question with a definitive and confident “No”.
After a few weeks, it became clear that VS Code’s interface and usage limits were becoming the bottleneck. I went to my boss, bullet points in hand, and easily got approval for the Claude Max $200 plan. Boom, another step-function increase.
We’re living in an incredibly exciting time to be a skilled developer. I understand the need to stay skeptical and measure the real benefits, but I feel like a lot of people are getting caught up in the culture war aspect and are missing out on something truly wonderful.
no, it's the same names, again and again
In contrast, what do I care if you believe in code generation AI? If you do, you are probably driving up pricing. I mean, I am sure that there are people that care very much, but there is little inherent value for me in you doing so, as long as the people who are building the AI are making enough profit to keep it running.
With regards to the VCs, well, how many VCs are there in the world? How many of the people who have something good to say about AI are likely VCs? I might be off by an order of magnitude, but even then it would really not be driving the discussion.
Generally, I do a couple of edits for clarity after posting and reading again. Sometimes that involves removing something that I feel could have been said better. If it does not work, I will just delete the comment. Whatever it was must not have been a super huge deal (to me).
We're in a hype cycle, and it means we should be extra critical when evaluating the tech so we don't get taken in by exaggerated claims.
An LLM that can test the code it is writing and then iterate to fix the bugs turns out to be a huge step forward from LLMs that just write code without trying to then exercise it.
That sounds like a claim you could back up with a little bit of time spent using Hacker News search or similar.
(I might try to get a tool like o3 to run those searches for me.)
So are you using Claude Code via the max plan, Cursor, or what?
I think I'd definitely hit AI news exhaustion and was viewing people raving about this agentic stuff as yet more AI fanbois. I'd just continued using the AI separate as setting up a new IDE seemed like too much work for the fractional gains I'd been seeing.
There is a skill gap, like, I think of it like vim: at first it slows you down, but then as you learn it, you end up speeding up. So you may also find that it doesn't really vibe with the way you work, even if I am having a good time with it. I know people who are great engineers who still don't like this stuff, just like I know ones that do too.
I do not program for my day job and I vibe coded two different web projects. One in twenty mins as a test with cloudflare deployment having never used cloudflare and one in a week over vacation (and then fixed a deep safari bug two weeks later by hammering the LLM). These tools massively raise the capabilities for sub-average people like me and decrease the time / brain requirements significantly.
I had to make a little update to reset the KV store on cloudflare and the LLM did it in 20s after failing the syntax twice. I would’ve spent at least a few minutes looking it up otherwise.
The people not buying into the hype, on the other hands, are actually the ones that have a very good reason to be invested, because if they turn out to be wrong they might face some very uncomfortable adjustments in the job landscape and a lot of the skills that they worked so hard to gain and believed to be valuable.
As always, be weary of any claims, but the tension here is very much the reverse of crypto and I don't think that's very appreciated.
The jump has been massive.
It's been a very noticeable uptick in power, and although there have been some nice increases with past model releases, this has been both the largest and the one that has unlocked the most real value since I've been following the tech.
[0]: https://marketplace.visualstudio.com/items?itemName=anthropi...
That's not a tradeoff that I like
im using claude + vscode's cline extension for the most part, but where it tends to excel is helping you write documentation, and then using that documentation to write reasonable code.
if you're 3/4 of the way done, a lot of the docs of what it wants to work well are gonna be missing, and so a lot of your intentions about why you did or didnt make certain choices will be missing. if you've got good docs, make sure to feed those in as context.
the agentic tool on its own is still kinda meh, if you only try to write code directly from it. definitely better than the non-agentic stuff, but if you start with trying to get it to document stuff, and ask you questions about what it should know in order to make the change its pretty good.
even if you dont get perfect code, or it spins in a feedback loop where its lost the plot, those questions it asks can be super handy in terms of code patterns that you havent thought about that apply to your code, and things that would usually be undefined behaviour.
my raving is that i get to leave behind useful docs in my code packages, and my team members get access to and use those docs, without the usual discoverability problems, and i get those docs for... somewhat slower than i could have written the code myself, but much much faster than if i also had to write those docs
The steam-powered loom was not good for the luddites either. Good for society at large in the long term but all the negative points that a 40 year old knitter in 1810 could make against the steam-powered loom would have been perfectly reasonable and accurate judged on that individual's perspective.
It's not showing its reasoning. "Reasoning" models are trained to output more tokens in the hope that more tokens means less hallucinations.
It's just a marketing trick and there is no evidence this sort of fake ""reasoning"" actually gives any benefit.
I pointed this out in my post for a reason. I get it. But even given a different person is saying the same thing every time a new release comes out - the effect on my prior is the same.
Keep writing your code manually, nobody cares.
I asked it two implement two bicubic filters, a high pass filter and a high shelf filter. Some context, using the gemini webapp it would split out the exact code I need with the interfaces I require one shot because this is truly trivial C++ code to write.
15 million tokens and an hour and a half later I now had a project that could not build, the filters were not implemented and my trust in AI agentic workflows broken.
It cost me nothing, I just reset the repo and I was watching youtube videos for that hour and a half.
Your mileage may vary and I’m very sure if this was golang or typescript it might have done significantly better, but even compared to the exact same model in a chat interface my experience was horrible.
I’m sticking to the slightly “worse” experience of using the chat interface which does give me significant improvements in productivity vs letting the agent burn money and time and not produce working code.