LLMs get better over time. In doing so they occasionally hit points where things that didn't work start working. "Agentic" coding tools that run commands in a loop hit that point within the past six months.
If your mental model is "people say they got better every six months, therefore I'll never take them seriously because they'll say it again in six months time" you're hurting your own ability to evaluate this (and every other) technology.
Today it works, it didn't in the past, but it does now. Rinse and repeat.
I've been using Cline and it can do a few of the things suggested as "agentic", but I'd have no idea how to leave it writing and then running tests in a VM and creating a PR for me to review. Or let it roam around in the file tree and create new files as needed. How does that work? Are there better tools for this? Or do I need to configure Cline in some way?
I actually do most of my "agentic coding" (not a fan of the term, but whatever) in ChatGPT Code Interpreter, which hasn't changed much in two years other than massive upgrades to the model it uses - I run that mainly via o4-mini-high or o3 these days.
OpenAI's Codex is a leading new thing, but only if you pay $200/month for it. Google's equivalent https://jules.google/ is currently free.
GitHub Copilot gained an "agent mode" recently: https://github.blog/ai-and-ml/github-copilot/agent-mode-101-...
There's also Copilot Coding Agent, which is confusingly an entirely different product: https://github.blog/changelog/2025-05-19-github-copilot-codi...
Yes, but other smart people were making this argument six months ago. Why should we trust the smart person we don't know now if we (looking back) shouldn't have trusted the smart person before?
Part of evaluating a claim is evaluating the source of the claim. For basically everybody, the source of these claim is always "the AI crowd", because those outside the AI space have no way of telling who is trustworthy and who isn't.
Really think about it and ask yourself if it's possible that AI can make any, ANY work a little more efficient?
In general, part of being an effective member of human society is getting good at evaluating who you should listen to and who is just hot air. I collect people who I consider to be credible and who have provided me with useful information in the past. If they start spouting junk I quietly drop them from my "pay attention to these people" list.
New models come out all the time. One of the most interesting signals to look out for is when they tip over the quality boundary from "not useful at task X" to "useful at task X". It happened for coding about a year ago. It happened for search-based research assistants just two months ago, in my opinion - I wrote about that here: https://simonwillison.net/2025/Apr/21/ai-assisted-search/
You write like this is some grand debate you are engaging in and trying to win. But to people on what you see as the other side, there is no debate. The debate is over.
You drag your feet at your own peril.
Do you use ChatGPT Code Interpreter because it's better, or is it just something you're more familiar with and you're sticking with it for convenience?
Of course, I don't know how one would structure a suitable test, since doing it sequentially would likely bias the later agents with clearer descriptions & feedback on the tasks. I imagine familiarity with how to prompt each particular model is also a factor.
I’ve definitely seen humans do stuff in an hour that takes others days to do. In fact, I see it all the time. And sometimes, I know people who have skills to do stuff very quickly but they choose not to because they’d rather procrastinate and not get pressured to pick up even more work.
And some people waste even more time writing stuff from scratch when libraries exist for whatever they’re trying to do, which could get them up and running quickly.
So really I don’t think these bold claims of LLMs being so much faster than humans hit as hard as some people think they do.
And here’s the thing: unless you’re using the time you save to fill yourself up with even more work, you’re not really making productivity gains, you’re just using an LLM to acquire more free time on the company dime.
You might as well do that since any productivity gains will go to your employer, not you.
1. LLM fanboy: "LLMs are awesome, they can do x, y, and z really well."
2. LLM skeptic: "OK, but I tried them and found them wanting for doing x, y, and z"
3. LLM fanboy: "You're doing it wrong. Do it this way ..."
4. The LLM skeptic goes to try it that way, still finds it unsatisfactory. A few months pass....
5. LLM fanboy: "Hey, have you tried model a.b.c-new? The problems with doing x, y, and z have now been fixed" (implicitly now agrees that the original complaints were valid)
6. LLM skeptic: "What the heck, I though you denied there were problems with LLMs doing x, y, and z? And I still have problems getting them to do it well"
7. Goto 3
That's an argument for LLMs.
>you’re just using an LLM to acquire more free time on the company dime.
This is a bad thing?
In reality, there is a limit to how quickly tasks can be done. Around here, the size of PRs usually have changes that most people could just type out in under 30 minutes if they knew exactly what to type. However, getting to the point where you know exactly what you need to type takes days or even weeks, often collaborating across many teams and thinking deep about potential long term impacts down the road, and balancing company ROI and roadmap objectives, perhaps even running experiments.
You cannot just throw LLMs at those problems and have them wrapped up in an hour. If that’s what you’re doing, you’re not working on big problems, you’re doing basic refactors and small features that don’t require high level skills, where the bottleneck is mostly how fast you can type.
You can't own your own SV home but you can become a slumlord somewhere else remotely.