zlacker

>If you were trying and failing to use an LLM for code 6 months ago †, you’re not doing what most serious LLM-assisted coders are doing.

Here’s the thing from the skeptic perspective: This statement keeps getting made on a rolling basis. 6 months ago if I wasn’t using the life-changing, newest LLM at the time, I was also doing it wrong and being a luddite.

It creates a never ending treadmill of boy-who-cried-LLM. Why should I believe anything outlined in the article is transformative now when all the same vague claims about productivity increases were being made about the LLMs from 6 months ago which we now all agree are bad?

I don’t really know what would actually unseat this epistemic prior at this point for me.

In six months, I predict the author will again think the LLM products of 6 month ago (now) were actually not very useful and didn’t live up to the hype.

replies(11): >>simonw+42 >>anxoo+gc >>dolebi+5g >>idlewo+wi >>stouse+cj >>someot+Jj >>librar+wk >>orions+Ak >>carpo+Yk >>mathgo+cl >>killer+In

>>davidc+(OP)
tptacek wasn't making this argument six months ago.

LLMs get better over time. In doing so they occasionally hit points where things that didn't work start working. "Agentic" coding tools that run commands in a loop hit that point within the past six months.

If your mental model is "people say they got better every six months, therefore I'll never take them seriously because they'll say it again in six months time" you're hurting your own ability to evaluate this (and every other) technology.

replies(4): >>JohnKe+rc >>espere+1g >>whoist+2j >>cmdli+5j

>>davidc+(OP)
name 5 tasks which you think current AIs can't do. then go and spend 30 minutes seeing how current AIs can do on them. write it on a sticky note and put it somewhere that you'll see it.

otherwise, yes, you'll continue to be irritated by AI hype, maybe up until the point where our civilization starts going off the rails

replies(7): >>apwell+Hf >>TheRoq+Cg >>AtlasB+8h >>poinca+Gi >>alison+tl >>ipaddr+Om >>chinch+hp2

>>simonw+42
But they say "yes, it didn't work 6 months ago, but it does now", and they say this every month. They're constantly resetting the goal post.

Today it works, it didn't in the past, but it does now. Rinse and repeat.

replies(6): >>Sebgue+9f >>jasonf+Sf >>concep+bi >>nsonha+8j >>jasonf+aj >>skwirl+jl

>>JohnKe+rc
this is only a compelling counter-argument if you are referring to a single, individual person who is saying this repeatedly. and there probably are! but the author of this article is not that person, and is also speaking to a very specific loop that only first truly became prevalent 6-9 months ago.

>>anxoo+gc
> name 5 tasks which you think current AIs can't do.

For coding it seems to back itself into a corner and never recover from it until i "reset" it .

AI can't write software without an expert guiding it. I cannot open a non trivial PR to postgres tonight using AI.

replies(1): >>simonw+Ej

>>JohnKe+rc
"they say this every month" But I think the commenter is saying "they" comprises many different people, and they can each honestly say, at different times, "LLMs just started working". I had been loving LLMs for solving NLP since they came out, and playing with them all the time, but in my field I've only found them to improve productivity earlier this year (gemini 2.5).

>>simonw+42
I stopped paying attention for a few days so I'm way out of date. What is the state of the art for agentic coding now?

I've been using Cline and it can do a few of the things suggested as "agentic", but I'd have no idea how to leave it writing and then running tests in a VM and creating a PR for me to review. Or let it roam around in the file tree and create new files as needed. How does that work? Are there better tools for this? Or do I need to configure Cline in some way?

replies(2): >>simonw+Ci >>nsonha+Cj

>>davidc+(OP)
> Here’s the thing from the skeptic perspective: This statement keeps getting made on a rolling basis.

Dude, just try the things out. It's just undeniable in my day-to-day life that I've been able to rely on Sonnet (first 3.7 and now 4.0) and Gemini 2.5 to absolutely crush code. I've done 3 side projects in the past 6 months that I would have been way too lazy to build without these tools. They work. Never going back.

replies(1): >>ryandr+bl

>>anxoo+gc
Well, I'll try to do a sticky note here:

- they can't be aware of the latest changes in the frameworks I use, and so force me to use older features, sometimes less efficient

- they fail at doing clean DRY practices even though they are supposed to skim through the codebase much faster than me

- they bait me into inexisting apis, or hallucinate solutions or issues

- they cannot properly pick the context and the files to read in a mid-size app

- they suggest to download some random packages, sometimes low quality ones, or unmaintained ones

replies(7): >>travis+Dh >>bdangu+5i >>simonw+bj >>motza+kj >>bradfa+fl >>apwell+Yn >>agotte+Zp

>>anxoo+gc
Everyone keeps thinking AI improvement is linear. I don't know if this is correct, but it's just my basic impression that the current AI boost came from instead of being limiting yourself to the CPU and its throughput adding the massive amount of computing power in graphics cards.

But for each nine of reliability you want out of llms everyone's assuming it's just a linear growth. I don't think it is. I think it's polynomial at least.

As for your tasks and maybe it's just cuz I'm using chat GPT, but I asked it to Port sed, something with full open source code availability, tons of examples/test cases, a fully documented user interface and I wanted it moved to Java as a library.

And it failed pretty spectacularly. Yeah it got the very very very basic functionality of sed.

replies(1): >>kaydub+tt

>>TheRoq+Cg
Those aren't tasks.

>>TheRoq+Cg
they can't be aware of the latest changes in the frameworks I use, and so force me to use older features, sometimes less efficient

of course they can, teach them / feed them latest changes or whatever you need (much like another developer unaware of the same thing)

they fail at doing clean DRY practices even though they are supposed to skim through the codebase much faster than me

tell them it is not DRY until they make it DRY. for some (several projects I’ve been involved with) DRY is generally anti-pattern when taken to extremes (abstraction gone awry etc…). instruct it what you expect and it and watch it deliver (much like you would another developer…)

they bait me into inexisting apis, or hallucinate solutions or issues

tell it when it hallucinates, it’ll correct itself

they cannot properly pick the context and the files to read in a mid-size app

provide it with context (you should always do this anyways)

they suggest to download some random packages, sometimes low quality ones, or unmaintained ones

tell it about it, it will correct itself

replies(2): >>esjeon+Yj >>apwell+2o

>>JohnKe+rc
I don’t think this is true actually. There was a huge shift of llm coding ability with the release of sonnet 2.5. That was a real shift in how people started using LMS for coding. Before that it was more of a novelty not something people used a lot for real work. As someone who is not a software engineer, as of about November 2024, I “write” hundreds of lines of code a day for meaningful work to get done.

replies(1): >>ipaddr+Jl

>>davidc+(OP)
An exponential curve looks locally the same at all points in time. For a very long period of time, computers were always vastly better than they were a year ago, and that wasn't because the computer you'd bought the year before was junk.

Consider that what you're reacting to is a symptom of genuine, rapid progress.

replies(3): >>Retr0i+pj >>pera+lm >>godels+Rm

>>espere+1g
tptacek is using Zed, which I've not tried myself.

I actually do most of my "agentic coding" (not a fan of the term, but whatever) in ChatGPT Code Interpreter, which hasn't changed much in two years other than massive upgrades to the model it uses - I run that mainly via o4-mini-high or o3 these days.

OpenAI's Codex is a leading new thing, but only if you pay $200/month for it. Google's equivalent https://jules.google/ is currently free.

GitHub Copilot gained an "agent mode" recently: https://github.blog/ai-and-ml/github-copilot/agent-mode-101-...

There's also Copilot Coding Agent, which is confusingly an entirely different product: https://github.blog/changelog/2025-05-19-github-copilot-codi...

replies(1): >>garble+Yl

>>anxoo+gc
1. create a working (moderately complex) ghidra script without hallucinating.

Granted I was trying to do this 6 months ago, but maybe a miracle has happened. But I'm the past I had very bad experience with using LLMs for niche things (i.e. things that were never mentioned on stackoverflow)

replies(1): >>simonw+tj

>>simonw+42
Have the models significantly improved, or have we just developed new programs that take better advantage of them?

replies(2): >>jppitt+gj >>dcre+Rl

>>simonw+42
> tptacek wasn't making this argument six months ago.

Yes, but other smart people were making this argument six months ago. Why should we trust the smart person we don't know now if we (looking back) shouldn't have trusted the smart person before?

Part of evaluating a claim is evaluating the source of the claim. For basically everybody, the source of these claim is always "the AI crowd", because those outside the AI space have no way of telling who is trustworthy and who isn't.

replies(3): >>simonw+Rj >>kalkin+wm >>secale+1n

>>JohnKe+rc
Why focus on the 6 months or however long you think the cycle is. The milestones of AI coding are self-explanatory: autocomplete (shit) -> multi-files edit (useful for simple cases) -> agent (feedback loop with rag & tool use), this is where we are.

Really think about it and ask yourself if it's possible that AI can make any, ANY work a little more efficient?

>>JohnKe+rc
I don't really get this argument. Technology can be improving can't it? You're just saying that people saying it's improving, isn't a great signal. Maybe not, but you still don't conclude that the tech isn't improving, right? If you're old enough, remember the internet was very much hyped. Al gore was involved. But it's probably been every bit as transformative as promised.

replies(1): >>prmph+An

>>TheRoq+Cg
"they can't be aware of the latest changes in the frameworks I use, and so force me to use older features, sometimes less efficient"

That's mostly solved by the most recent ones that can run searches. I've had great results from o4-mini for this, since it can search for the latest updates - example here: https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

Or for a lot of libraries you can dump the ENTIRE latest version into the prompt - I do this a lot with the Google Gemini 2.5 models since those can handle up to 1m tokens of input.

"they fail at doing clean DRY practices" - tell them to DRY in your prompt.

"they bait me into inexisting apis, or hallucinate solutions or issues" - really not an issue if you're actually testing your code! I wrote about that one here: https://simonwillison.net/2025/Mar/2/hallucinations-in-code/ - and if you're using one of the systems that runs your code for you (as promoted in tptacek's post) it will spot and fix these without you even needing to intervene.

"they cannot properly pick the context and the files to read in a mid-size app" - try Claude Code. It has a whole mechanism dedicated to doing just that, I reverse-engineered it this morning: https://simonwillison.net/2025/Jun/2/claude-trace/

"they suggest to download some random packages, sometimes low quality ones, or unmaintained ones" - yes, they absolutely do that. You need to maintain editorial control over what dependencies you add.

replies(2): >>TheRoq+4m >>timr+Jn

>>davidc+(OP)
I saw this article and thought, now's the time to try again!

Using Claude Sonnet 4, I attempted to add some better configuration to my golang project. An hour later, I was unable to get it to produce a usable configuration, apparently due to a recent v1-to-v2 config format migration. It took less time to hand-edit one based on reading the docs.

I keep getting told that this time agents are ready. Every time I decide to use them they fall flat on their face. Guess I'll try again in six months.

replies(4): >>Yiin+ok >>simonw+Tk >>porrid+tm >>Darmok+Bn

>>whoist+2j
Both.

replies(1): >>simonw+ak

>>TheRoq+Cg
I have definitely noticed these as well. Have you ever tried prompting these issues away? I'm thinking this might be a good list to add to every coding prompt

>>idlewo+wi
I don't think anyone's contesting that LLMs are better now than they were previously.

replies(1): >>Neverm+2k

>>poinca+Gi
I've never heard of Ghidra before but, in case you're interested, I ran that prompt through OpenAI's o3 and Anthropic's Claude Opus 4 for you just now (both of them the latest/greatest models from those vendors and new as of less than six months ago) - results here: https://chatgpt.com/share/683e3e38-cfd0-8006-9e49-2aa799dac4... and https://claude.ai/share/7a076ca1-0dee-4b32-9c82-8a5fd3beb967

I have no way of evaluating these myself so they might just be garbage slop.

replies(1): >>jcranm+Mp

>>espere+1g
The current state of agent in the last batch of project launches (copilot agent, jules, devin...) is to take over and do things in a PR like you want. However the vedict is still out there in terms of whether these implementation prove more useful than agentic code in an IDE.

>>apwell+Hf
"AI can't write software without an expert guiding it. I cannot open a non trivial PR to postgres tonight using AI."

100% true, but is that really what it would take for this to be useful today?

>>davidc+(OP)
Also, professional programmers have varying needs. These people are coding in different languages, with varying complexity, domains, existing code bases and so on.

People making arguments based on sweeping generalizations to a wide audience are often going to be perceived as delusional, as their statements do not apply universally to everyone.

To me, thinking LLMs can code generally because you have success with them and then telling others they are wrong in how they use them is making a gigantic assumptive leap.

replies(1): >>spacem+fk

>>cmdli+5j
Part of being on Hacker News is learning that there are people in this community - like tptacek - who are worth listening to.

In general, part of being an effective member of human society is getting good at evaluating who you should listen to and who is just hot air. I collect people who I consider to be credible and who have provided me with useful information in the past. If they start spouting junk I quietly drop them from my "pay attention to these people" list.

>>bdangu+5i
Anecdotally, ChatGPT still struggles with its own API. It keeps juggling between different versions of its API and hallucinates API parameters, even when I force-feed official documents into the context (to be fair, the documentation is straight awful). Sometimes it totally refuses to change its basic assumptions, so I have to blow up the context just to make it use the up-to-date API correctly.

LLMs are stupid - nothing magic, nothing great. They’re just tools. The problem with the recent LLM craze is that people make too many obviously partially true statements.

replies(1): >>simonw+Gk

>>Retr0i+pj
> It creates a never ending treadmill of boy-who-cried-LLM.

The crying wolf reference only makes sense as a soft claim that LLM’s better or not, are not getting better in important ways.

Not a view I hold.

replies(1): >>Retr0i+Im

>>jppitt+gj
Yeah, definitely both.

New models come out all the time. One of the most interesting signals to look out for is when they tip over the quality boundary from "not useful at task X" to "useful at task X". It happened for coding about a year ago. It happened for search-based research assistants just two months ago, in my opinion - I wrote about that here: https://simonwillison.net/2025/Apr/21/ai-assisted-search/

>>someot+Jj
I just assume every blog post in HN starts with “As a web dev, TITLE”

>>stouse+cj
you can add links to docs to llm agents instead of letting them work blindfolded with hardcoded assumptions

replies(2): >>maleld+xn >>stouse+JN

>>davidc+(OP)
This isn't a particularly useful filter, because it applies to many very successful technologies as well. Early automobiles generated a lot of hype and excitement, but they were not very good (unreliable, loud, and dangerous, and generally still worse than horses). They got steadily better until eventually they hit an inflection point where the skeptics were dug in repeating the same increasingly old complaints, while Henry Ford was building the Model T.

>>davidc+(OP)
At what point would you be impressed by a human being if you asked it to help you with a task every 6 months from birth until it was 30 years old?

If you ask different people the above question, and if you vary it based on type of task, or which human, you would get different answers. But as time goes on, more and more people would become impressed with what the human can do.

I don't know when LLMs will stop progressing, but all I know is they continue to progress at what is to me a similar astounding rate as to a growing child. For me personally, I never used LLMs for anything, and since o3 and Gemini 2.5 Pro, I use them all the time for all sorts of stuff.

You may be smarter than me and still not impressed, but I'd try the latest models and play around, and if you aren't impressed yet, I'd bet money you will be within 3 years max (likely much earlier).

replies(1): >>Velori+Gl

>>esjeon+Yj
That's because GPT-4o's training cut-off is Sep 30, 2023 (see https://platform.openai.com/docs/models/gpt-4o) and the OpenAI API has changed a LOT since then.

Claude 4 has a training cut-off of March 2025, I tried something today about its own API and it gave me useful code.

>>stouse+cj
If you share your conversation (with the share link in Claude) I'd be happy to see if there are any tweaks I can suggest to how you prompted it.

>>davidc+(OP)
I think they just meant it hit an inflection point. Some people were copying pasting to ChatGPT and saying it was crap and others were using agents that could see the context of the code and worked much, much better. It's the workflow used not just the specific LLM.

>>dolebi+5g
Why can't reviews of AI be somewhere in the middle between "useless" and "the second coming"?

I tried Copilot a few months ago just to give it a shot and so I could discuss it with at least a shred of experience with the tool, and yea, it's a neat feature. I wouldn't call it a gimmick--it deserves a little more than that, but I didn't exactly cream my pants over it like a lot of people seem to be doing. It's kind of convenient, like a smart autocomplete. Will it fundamentally change how I write software? No way. But it's cool.

replies(1): >>dolebi+rn

>>davidc+(OP)
In my experience it's less about the latest generation of LLMs being better, and more about the tooling around them for integration into a programmer's workflow being waaaay better.

The article doesn't explicitly spell it out until several paragraphs later, but I think what your quoted sentence is alluding to is that Cursor, Cline et al can be pretty revolutionary in terms of removing toil from the development process.

Need to perform a gnarly refactor that's easy to describe but difficult to implement because it's spread far and wide across the codebase? Let the LLM handle it and then check its work. Stuck in dependency hell because you updated one package due to a CVE? The LLM can (often) sort that out for you. Heck, did the IDE's refactor tool fail at renaming a function again? LLM.

I'm remain skeptical of LLM-based development insofar as I think the enshitification will inevitably come when the Magic Money Machine breaks down. And I don't think I would hire a programmer that needs LLM assistance in order to program. But it's hard to deny that it has made me a lot more productive. At the current price it's a no-brainer to use it.

replies(1): >>tho23j+Oq

>>TheRoq+Cg
They also can’t hold copyright on their creations.

>>JohnKe+rc
It doesn’t really matter what this or that person said six months ago or what they are saying today. This morning I used cursor to write something in under an hour that previously would have taken me a couple of days. That is what matters to me. I gain nothing from posting about my experience here. I’ve got nothing to sell and nothing to prove.

You write like this is some grand debate you are engaging in and trying to win. But to people on what you see as the other side, there is no debate. The debate is over.

You drag your feet at your own peril.

replies(1): >>deadba+zm

>>anxoo+gc
The problem with AI hype is not really about whether a particular model can - in the abstract - solve a particular programming problem. The problem with AI hype is that it is selling a future where all software development companies become entirely dependent on closed systems.

All of the state-of-the-art models are online models - you have no choice, you have to pay for a black box subscription service controlled by one of a handful of third-party gatekeepers. What used to be a cost center that was inside your company is now a cost center outside your company, and thus it is a risk to become dependent on it. Perhaps the risk is worthwhile, perhaps not, but the hype is saying that real soon now it will be impossible to not become dependent on these closed systems and still exist as a viable company.

>>orions+Ak
> At what point would you be impressed by a human being if you asked it to help you with a task every 6 months from birth until it was 30 years old?

In this context, never. Especially because the parent knows you will always ask 2+2 and can just teach the child to say “four” as their first and only word. You’ll be on to them, too.

replies(2): >>Velori+Bo >>x-comp+dq

>>concep+bi
How did you manage before?

replies(1): >>concep+BY2

>>whoist+2j
Both, but it’s mostly the models. The programs like Claude Code are actually simpler than the ones from before because of this.

>>simonw+Ci
I'd be quite interested in a more formal post with a detailed analysis of the effectiveness of the different agent impls, including Claude Code and Jetbrains Junie.

Do you use ChatGPT Code Interpreter because it's better, or is it just something you're more familiar with and you're sticking with it for convenience?

Of course, I don't know how one would structure a suitable test, since doing it sequentially would likely bias the later agents with clearer descriptions & feedback on the tasks. I imagine familiarity with how to prompt each particular model is also a factor.

replies(1): >>simonw+NB

>>simonw+bj
Thanks for the links. You mentioned 2 models in your posts, how should I proceed ? I can't possibly pay 2 subscriptions.. do you have a question for the better one to use ?

replies(1): >>simonw+cr

>>idlewo+wi
A flatline also looks locally the same at all points in time.

replies(1): >>EA-316+jn

>>stouse+cj
Yes.

I made the mistake of procrastinating on one part of a project thinking "Oh, that is easily LLMable". By God, was I proven wrong. Was quite the rush before the deadline.

On the flip side, I'm happy I don't have to write the code for a matplotlib scatterplot for the 10000th time, it mostly gets the variables in the current scope that I intended to plot. But I've really not had that much success on larger tasks.

The "information retrieval" part of the tech is beautiful though. Hallucinations are avoided only if you provide an information bank in the context in my experience. If it needs to use the search tool itself, it's not as good.

Personally, I haven't seen any improvement from the "RLd on math problems" models onward (I don't care for benchmarks). However, I agree that deepseek-r1-zero was a cool result. Pure RL (plain R1 used a few examples) automatically leading to longer responses.

A lot of the improvements suggested in this thread are related to the infra around LLMs such as tool use. These are much more well organised these days with MCP and what not, enabling you to provide it the aforementioned information bank easily. But all of it is built on top of the same fragile next-token generator we know and love.

>>cmdli+5j
If you automatically lump anyone who makes an argument that AI is capable - not even good for the world on net, just useful in some tasks - into "the AI crowd", you will tautologically never hear that argument from anywhere else. But if you've been paying attention to software development discussion online for a few years, you've plausibly heard of tptacek and kentonv, eg, from prior work. If you haven't heard of them in particular, no judgement, but you gotta have someone you can classify as credible independently of their AI take if you want to be able to learn anything at all from other people on the subject.

>>skwirl+jl
The thing about people making claims like “An LLM did something for me in an hour that would take me days” is that people conveniently leave out what their own skill level is.

I’ve definitely seen humans do stuff in an hour that takes others days to do. In fact, I see it all the time. And sometimes, I know people who have skills to do stuff very quickly but they choose not to because they’d rather procrastinate and not get pressured to pick up even more work.

And some people waste even more time writing stuff from scratch when libraries exist for whatever they’re trying to do, which could get them up and running quickly.

So really I don’t think these bold claims of LLMs being so much faster than humans hit as hard as some people think they do.

And here’s the thing: unless you’re using the time you save to fill yourself up with even more work, you’re not really making productivity gains, you’re just using an LLM to acquire more free time on the company dime.

replies(3): >>maleld+kn >>skwirl+Ur >>Throwa+qv

>>Neverm+2k
The implicit claim is just that they're still not good enough (for whatever the use cases the claimant had in mind)

>>anxoo+gc
Make me a million dollars

Tell me about this specific person who isn't famous

Create a facebook clone

Recreate Windows including drivers

Create a way to transport matter like in Star Trek.

I'll see you in 6 months.

>>idlewo+wi

  > An exponential curve looks locally the same at all points in time

This is true for any curve...

If your curve is continuous, it is locally linear.

There's no use in talking about the curve being locally similar without the context of your window. Without the window you can't differentiate an exponential from a sigmoid from a linear function.

Let's be careful with naive approximations. We don't know which direction things are going and we definitely shouldn't assume "best case scenario"

replies(1): >>sfpott+9n

>>cmdli+5j
Thomas is one of the pickier, crankier, least faddish technologists I've ever met. If he has gone fanboy that holds a lot of weight with me.

>>godels+Rm
A curve isn't necessarily locally linear if it's continuous. Take f(x) = |x|, for example.

replies(2): >>nickff+yn >>godels+ss

>>pera+lm
Nor does local flatness imply direction, the curve could be descending for all that "looks locally flat" matters. It also isn't on the skeptics to disprove that "AI" is a transformative, exponentially growing miracle, it's on the people selling it.

>>deadba+zm
> you’re just using an LLM to acquire more free time on the company dime

You might as well do that since any productivity gains will go to your employer, not you.

replies(1): >>astran+eW

>>ryandr+bl
My anecdotes are all I've got, bro. If they come a little creamy, well, YMMV.

>>Yiin+ok
Claude Code will even request access documentation on its own sometimes. I caught it asking to run a `pydoc` command the other day. I'm not sure if it has access to web search, but it should.

>>sfpott+9n
There may have been a discontinuity at the beginning of time... but there was nobody there to observe it. More seriously, the parent is saying that it always looks continuous linear when you're observing the last short period of time, whereas the OP (and many others) are constantly implying that there are recent discontinuities.

replies(1): >>godels+uA

>>jasonf+aj
Technology improving is not the issue.

1. LLM fanboy: "LLMs are awesome, they can do x, y, and z really well."

2. LLM skeptic: "OK, but I tried them and found them wanting for doing x, y, and z"

3. LLM fanboy: "You're doing it wrong. Do it this way ..."

4. The LLM skeptic goes to try it that way, still finds it unsatisfactory. A few months pass....

5. LLM fanboy: "Hey, have you tried model a.b.c-new? The problems with doing x, y, and z have now been fixed" (implicitly now agrees that the original complaints were valid)

6. LLM skeptic: "What the heck, I though you denied there were problems with LLMs doing x, y, and z? And I still have problems getting them to do it well"

7. Goto 3

>>stouse+cj
> It took less time to hand-edit one based on reading the docs.

You can give it the docs as an "artifact" in a project - this feature has been available for almost one year now.

Or better yet, use the desktop version + a filesystem MCP server pointing to a folder containing your docs. Tell it to look at the docs and refactor as necessary. It is extremely effective at this. It might also work if you just give it a link to the docs.

replies(1): >>stouse+FN

>>davidc+(OP)
Bullshit. We have absolute numbers, not just vibes.

The top of SWE-bench Verified leaderboard was at around 20% in mid-2024, i.e. AI was failing at most tasks.

Now it's at 70%.

Clearly it's objectively better at tackling typical development tasks.

And it's not like it went from 2% to 7%.

replies(1): >>lexand+4q

>>simonw+bj
> Or for a lot of libraries you can dump the ENTIRE latest version into the prompt - I do this a lot with the Google Gemini 2.5 models since those can handle up to 1m tokens of input.

See, as someone who is actually receptive to the argument you are making, sometimes you tip your hand and say things that I know are not true. I work with Gemini 2.5 a lot, and while yeah, it theoretically has a large context window, it falls over pretty fast once you get past 2-3 pages of real-world context.

> "they fail at doing clean DRY practices" - tell them to DRY in your prompt.

Likewise here. Simply telling a model to be concise has some effect, to be sure, but it's not a panacea. I tell the latest models do do all sorts of obvious things, only to have them turn around and ignore me completely.

In short, you're exaggerating. I'm not sure why.

replies(1): >>simonw+0s

>>TheRoq+Cg
> - they bait me into inexisting apis, or hallucinate solutions or issues

yes. this happens to me almost every time i use it. I feel like a crazy person reading all the AI hype.

>>bdangu+5i
> tell it when it hallucinates, it’ll correct itself

no it doesn't. Are you serious?

replies(1): >>bdangu+1p

>>Velori+Gl
To be clear, I’m just saying the analogy isn’t great, not that one can never be impressed by an LLM (or a person for that matter)!

>>apwell+2o
just today 3 times and countless times before… you just gotta take some serious time to learn and understand it… or alternatively write snarky comments on the internet…

replies(2): >>apwell+aL1 >>sorami+kg3

>>simonw+tj
The first one doesn't seem to actually give me the script, so I can't test it.

The second one didn't work for me without some code modification (specifically, the "count code blocks" didn't work), but the results were... not impressive.

It starts by ignoring every function that begins with "FUN_" on the basis that it's "# Skip compiler-generated functions (optional)". Sorry, but those functions aren't compiler-generated functions, they're functions that lack any symbol names, which in ghidra terms, is pretty damn common if you're reverse engineering unsymbolized code. If anything, it's the opposite of what you would want, because the named functions are the ones I've already looked at and thus give less of a guideline for interesting ones to look into next.

Looking at the results at a project I had open, it's supposed to be skipping external functions, but virtually all the top xrefs are external functions.

Finally, as a "moderately complex" script... it's not a good example. The only thing that approaches that complexity is trying to count basic blocks in a function--something that actually engages with the code model of Ghidra--but that part is broken, and I don't know Ghidra well enough to fix it. Something that would be more along the lines of "moderately complex" to me would be (to use a use case I actually have right now) for example turning the constant into a reference to that offset in the assumed data segment. Or finding all the switch statements that ghidra failed to decompile!

>>TheRoq+Cg
> they can't be aware of the latest changes in the frameworks I use, and so force me to use older features, sometimes less efficient

This is where collaboration comes in play. If you solely rely on the LLM to “vibe code” everything, then you’re right, you get whatever it thinks is best at the time of generation. That could be wrong or outdated.

My workflow is to first provide clear requirements, generally one objective at a time. Sometimes I use an llm to format the requirements for the llm to generate code from. It then writes some code, and I review it. If I notice something is outdated I give it a link to the docs and tell it to update it using X. A few seconds later it’s made the change. I did this just yesterday when building out an integration with an api. Claude wrote the code using a batch endpoint because the steaming endpoint was just released and I don’t think it was aware of it. My role in this collaboration, is to be aware of what’s possible and how I want it to work (e.g.. being aware of the latest features and updates of the frameworks and libraries). Then it’s just about prompting and directing the llm until it works the way I want. When it’s really not working, then I jump in.

>>killer+In
Isn't SWE-bench based on public Github issues? Wouldn't the increase in performance also be explained by continuing to train on newer scraped Github data, aka training on the test set?

The pressure for AI companies to release a new SOTA model is real, as the technology rapidly become commoditised. I think people have good reason to be skeptical of these benchmark results.

replies(1): >>killer+Hq

>>Velori+Gl
> In this context, never. Especially because the parent knows you will always ask 2+2 and can just teach the child to say “four” as their first and only word. You’ll be on to them, too.

On the assumption that you'll always only ask it "what's 2+2?" Keywords being "always" & "you".

In aggregate, the set of questions will continuously expand as a non-zero percentage of people will ask new questions. The set of questions asked will continue to expand, and the LLMs will continue to be trained to fill in the last 20%.

Even under the best interpretations, this is the detractors continuously moving goalposts, because the last 20% will never be filled: New tasks will continuously be found, and critics will point to them as "oh, see, they can't do that". By the time that the LLMs can do those tasks, the goalpost will be moved to a new point and they'll continue to be hypocrites.

------

> > At what point would you be impressed by a human being if you asked it to help you with a task every 6 months from birth until it was 30 years old?

Taking GP's question seriously:

When a task consisting of more than 20 non-decomposable (atomic) sub-tasks is completed above 1 standard deviation of the human average in that given task. (much more likely)

OR

When an advancement is made in a field by that person. (statistically much rarer)

>>lexand+4q
That sounds like a conspiracy theory. If it was just some mysterious benchmark and nothing else then sure, you have reasons to be skeptical.

But there's a plenty of people who actually tried LLMs for actual work and swear they work now. Do you think they are all lying?..

Many people with good reputation, not just noobs.

>>mathgo+cl
It's great when it works, but half the time IME it's so stupid that it can't even use the edit/path tools properly even when given line numbers prepended inputs.

(I should know since I've created half-a-dozen tools for this with gptel. Cline hasn't been any better on my codebase.)

replies(1): >>karthi+Sx

>>TheRoq+4m
If you're only going to pay one $20/month subscription I think OpenAI wins at the moment - their search tools are better and their voice chat interface is better too.

I personally prefer the Claude models but they don't offer quite as rich a set of extra features.

If you want to save money, consider getting API accounts with them and spending money that way. My combined API bill across OpenAI, Anthropic and Gemini rarely comes to more than about $10/month.

>>deadba+zm
Again, implicit in this comment is the belief that I am out to or need to convince you of something. You would be the only person who would benefit from that. I don’t gain anything from it. All I get out of this is having insulting comments about my “skill level” posted by someone who knows nothing about me.

replies(1): >>deadba+6d2

>>timr+Jn
I stand by both things I said. I've found that dumping large volumes of code I to the Gemini 2.5 models works extremely well. They also score very highly on the various needle in a haystack benchmarks.

This wasn't true of the earlier Gemini large context models.

And for DRY: sure, maybe it's not quite as easy as "do DRY". My longer answer is that these things are always a conversation: if it outputs code that you don't like, reply and tell it how to fix it.

replies(1): >>timr+5w

>>sfpott+9n
|x| is piece wise continuous, not absolutely continuous

replies(1): >>sfpott+cq2

>>AtlasB+8h
Of course it didn't port sed like that. It doesn't matter that it's open source with tons of examples/test cases. It's not going to go read all the code and change it to a different language. It can pick out what sed's purpose is and it built it for you in the language you asked.

>>deadba+zm
>And some people waste even more time writing stuff from scratch when libraries exist for whatever they’re trying to do

That's an argument for LLMs.

>you’re just using an LLM to acquire more free time on the company dime.

This is a bad thing?

>>simonw+0s
Yeah, I'm aware of the benchmarks. Thomas (author of TFA) is also using Gemini 2.5, and his comments are much closer to what I experience:

> For the last month or so, Gemini 2.5 has been my go-to (because it can hold 50-70kloc in its context window). Almost nothing it spits out for me merges without edits.

I realize this isn't the same thing you're claiming, but it's been consistently true for me that the model hallucinates stuff in my own code, which shouldn't be possible, given the context window and the size of the code I'm giving to it.

(I'm also using it for other, harder problems, unrelated to code, and I can tell you factually that the practical context window is much smaller than 2M tokens. Also, of course, a "token" is not a word -- it's more like 1/3 of a word.)

replies(1): >>simonw+iG

>>tho23j+Oq
Do Cursor and co have better tools than the ones we write ourselves for lower-level interfaces like gptel? Or do they work better because they add post-processing layers that verify the state of the repo after the tool call?

replies(1): >>tho23j+QK

>>nickff+yn
I think they read curve and didn't read continuous.

Which ends up making some beautiful irony. One small seemingly trivial point fucked everything up. Even a single word can drastically change everything. The importance of subtlety being my entire point ¯\_(ツ)_/¯

>>garble+Yl
I like Code Interpreter because I'm deeply familiar with it. I don't have to worry about safety at all because it's running in OpenAI's kubernetes container, not on my own laptop. I can control exactly what it can see by selectively uploading files to it. I know it can't make outbound network requests.

>>timr+5w
That's why I said 1m tokens, not 2m tokens. I don't trust them for 2m tokens yet.

>>karthi+Sx
Cursor is proprietary, but is known to index code for doing queries etc.

Cline is closer in spirit to GPTel, but since CLINE is an actual business, it does seem to do well off the bat. That said, I haven't found it to be "hugely better" compared to whatever you can hack in GPTel.

Quite frankly being able to hack the tools on the go in Elisp, makes GPTel far far better (for some of us anyway).

(Thanks for creating GPTel BTW!)

>>Darmok+Bn
The agent reached out to the internet and pulled the golangci-lint docs. Repeatedly. After generating a v1-compatible config I pointed it to the v2 docs. It tried to do the automatic migration but still wound up with incorrect syntax. I asked it to throw away what it had and build a fresh v2-compatible config. It again consulted the docs. Repeat ad nauseam.

I threw in the towel and had a working config in ten minutes.

>>Yiin+ok
It reached out to the Internet and pulled the docs. Repeatedly. I even linked to the docs directly to be helpful.

>>maleld+kn
That's not how FAANG compensation works.

replies(1): >>maleld+lm2

>>bdangu+1p
intresting. for me it just keeps making up new stuff that doesn't exist when i feed it the error and telling it hallucinates.

perhaps ppl building crud webapps have different experience than ppl building something niche?

replies(1): >>bdangu+B12

>>apwell+aL1
I don't build CRUD apps, 1099 contractor so too expensive to do that kind of work :)

>>skwirl+Ur
You don’t know the harm you’re inflicting. Some manager will read your comment and conclude that anyone who isn’t reducing tasks that previously took hours or days into a brief 1 hour LLM session is underperforming.

In reality, there is a limit to how quickly tasks can be done. Around here, the size of PRs usually have changes that most people could just type out in under 30 minutes if they knew exactly what to type. However, getting to the point where you know exactly what you need to type takes days or even weeks, often collaborating across many teams and thinking deep about potential long term impacts down the road, and balancing company ROI and roadmap objectives, perhaps even running experiments.

You cannot just throw LLMs at those problems and have them wrapped up in an hour. If that’s what you’re doing, you’re not working on big problems, you’re doing basic refactors and small features that don’t require high level skills, where the bottleneck is mostly how fast you can type.

>>astran+eW
FAANG engineers are still working class.

replies(1): >>astran+CN7

>>anxoo+gc
If AI can do anything, why can't I just prompt "Here is sudo access to my laptop, please do all my work for me, respond to emails, manage my household budget, and manage my meetings".

I've tried everything. I have four AI agents. They still have an accuracy rate of about 50%.

>>godels+ss
For a function to be locally linear at a point, it needs to be differentiable at that point... |x| isn't differentiable at 0, so it isn't locally linear at 0... that's the entirety of what I'm saying. :-)

replies(1): >>godels+743

>>ipaddr+Jl
The work just wasn’t done. Or it took enough time for me to go and learn how to do it.

>>sfpott+cq2
You're not wrong. But it has nothing to do with what I said. I think you missed an important word...

Btw, my point was all about how nuances make things hard. So ironically, thanks for making my point clearer.

replies(1): >>sfpott+kl3

>>bdangu+1p
So when LLMs go around in circles, as it often does [1], that's a skill issue. But when it gets it right some of the time, that's proof of superiority.

This is the kind of reasoning that dominates LLM zealotry. No evidence given for extraordinary claims. Just a barrage of dismissals of legitimate problems. Including the article in discussion.

All of this makes me have a hard time taking any of it seriously.

[1]: >>44050152

>>godels+743
Nothing to do with what you said?

  This is true for any curve...

  If your curve is continuous, it is locally linear.

Hmm...

Sometimes naive approximations are all you've got; and in fact, aren't naive at all. They're just basic. Don't overthink it.

>>maleld+lm2
It's hard to be working class on like 4x the median income and stock compensation.

You can't own your own SV home but you can become a slumlord somewhere else remotely.