zlacker

[return to "My AI skeptic friends are all nuts"]
1. jszymb+JM[view] [source] 2025-06-03 03:48:33
>>tablet+(OP)
The argument that I've heard against LLMs for code is that they create bugs that, by design, are very difficult to spot.

The LLM has one job, to make code that looks plausible. That's it. There's no logic gone into writing that bit of code. So the bugs often won't be like those a programmer makes. Instead, they can introduce a whole new class of bug that's way harder to debug.

◧◩
2. mindwo+TN[view] [source] 2025-06-03 04:05:29
>>jszymb+JM
This is a misunderstanding. Modern LLMs are trained with RL to actually write good programs. They aren't just spewing tokens out.
◧◩◪
3. godels+0S[view] [source] 2025-06-03 04:50:30
>>mindwo+TN
No, YOU misunderstand. This isn't a thing RL can fix

  https://news.ycombinator.com/item?id=44163194

  https://news.ycombinator.com/item?id=44068943
It doesn't optimize "good programs". It interprets "humans interpretation of good programs." More accurately, "it optimizes what low paid over worked humans believe are good programs." Are you hiring your best and brightest to code review the LLMs?

Even if you do, it still optimizes tricking them. It will also optimize writing good programs, but you act like that's a well defined and measurable thing.

◧◩◪◨
4. tptace+471[view] [source] 2025-06-03 07:26:34
>>godels+0S
I don't know if any of this applies to the arguments in my article, but most of the point of it is that progress in code production from LLMs is not a consequence of better models (or fine tuning or whatever), but rather on a shift in how LLMs are used, in agent loops with access to ground truth about whether things compile and pass automatic acceptance. And I'm not claiming that closed-loop agents reliably produce mergeable code, only that they've broken through a threshold where they produce enough mergeable code that they significantly accelerate development.
◧◩◪◨⬒
5. godels+Tj1[view] [source] 2025-06-03 09:43:17
>>tptace+471

  > I don't know if any of this applies to the arguments

  > with access to ground truth
There's the connection. You think you have ground truth. No such thing exists
◧◩◪◨⬒⬓
6. tptace+JF2[view] [source] 2025-06-03 18:54:20
>>godels+Tj1
It's even simpler than what 'rfrey said. You're here using "ground truth" in some kind of grand epistemic sense, and I simply mean "whether the exit code from a program was 1 or 0".

You can talk about how meaningful those exit codes and error messages are or aren't, but the point is that they are profoundly different than the information an LLM natively operates with, which are atomized weights predicting next tokens based on what an abstract notion of a correct line of code or an error message might look like. An LLM can (and will) lie to itself about what it is perceiving. An agent cannot; it's just 200 lines of Python, it literally can't.

◧◩◪◨⬒⬓⬔
7. godels+Ec3[view] [source] 2025-06-03 22:16:02
>>tptace+JF2

  > You're here using "ground truth" in some kind of grand epistemic sense
I used the word "ground truth" because you did!

  >> in agent loops with access to ground truth about whether things compile and pass automatic acceptance.
Your critique about "my usage of ground truth" is the same critique I'm giving you about it! You really are doing a good job at making me feel like I'm going nuts...

  > the information an LLM natively operates with,
And do you actually know what this is?

I am a ML researcher you know. And one of those ones that keeps saying "you should learn the math." There's a reason for this, because it is really connected to what you're talking about here. They are opaque, but they sure aren't black boxes.

And it really sounds like you're thinking the "thinking" tokens are remotely representative of the internal processing. You're a daily HN user, I'm pretty sure you saw this one[0].

I'm not saying anything OpenAI hasn't[1]. I just recognize that this applies to more than a very specific narrow case...

[0] >>44074111

[1] https://cdn.openai.com/pdf/34f2ada6-870f-4c26-9790-fd8def563...

◧◩◪◨⬒⬓⬔⧯
8. tptace+ke3[view] [source] 2025-06-03 22:30:47
>>godels+Ec3
Right, I'm just saying, I meant something else by the term than you did. Again: my point is, the math of the LLM doesn't matter to the point I'm making. It's not the model figuring out whether the code actually compiled. It's 200 lines of almost straight-line Python code that has cracked the elusive computer science problem of running an executable and checking the exit code.
[go to top]