zlacker

[parent] [thread] 18 comments
1. xyzzy1+(OP)[view] [source] 2025-05-07 03:10:09
I can't point to any evidence. Also I can't think of what direct evidence I could present that would be convincing, short of an actual demonstration? I would like to try to justify my intuition though:

Seems like the key question is: should we expect AI programming performance to scale well as more compute and specialised training is thrown at it? I don't see why not, it seems an almost ideal problem domain?

* Short and direct feedback loops

* Relatively easy to "ground" the LLM by running code

* Self-play / RL should be possible (it seems likely that you could also optimise for aesthetics of solutions based on common human preferences)

* Obvious economic value (based on the multi-billion dollar valuations of vscode forks)

All these things point to programming being "solved" much sooner than say, chemistry.

replies(4): >>cap4+b3 >>energy+kb >>microt+8e >>ssalaz+Nw1
2. cap4+b3[view] [source] 2025-05-07 03:49:39
>>xyzzy1+(OP)
This is correct. No idea how people don't see this trend or consider it
3. energy+kb[view] [source] 2025-05-07 05:43:37
>>xyzzy1+(OP)
This is my view. We've seen this before in other problems where there's an on-hand automatic verifier. The nature of the problem mirrors previously solved problems.

The LLM skeptics need to point out what differs with code compared to Chess, DoTA, etc from a RL perspective. I don't believe they can. Until they can, I'm going to assume that LLMs will soon be better than any living human at writing good code.

replies(2): >>AnIris+od >>klabb3+BI
◧◩
4. AnIris+od[view] [source] [discussion] 2025-05-07 06:12:15
>>energy+kb
> The LLM skeptics need to point out what differs with code compared to Chess, DoTA, etc from a RL perspective.

An obviously correct automatable objective function? Programming can be generally described as converting a human-defined specification (often very, very rough and loose) into a bunch of precise text files.

Sure, you can use proxies like compilation success / failure and unit tests for RL. But key gaps remain. I'm unaware of any objective function that can grade "do these tests match the intent behind this user request".

Contrast with the automatically verifiable "is a player in checkmate on this board?"

replies(2): >>energy+1g >>Hautho+OK
5. microt+8e[view] [source] 2025-05-07 06:20:38
>>xyzzy1+(OP)
LLMs will still hit a ceiling without human-like reasoning. Even two weeks ago, Claude 3.7 made basic mistakes like trying to convince me the <= and >= operators on Python sets have the same semantics [1]. Any human would quickly reject something like that (why would be two different operators evaluate to the same value), unless there is overwhelming evidence. Mistakes like this show up all the time, which makes me believe LLMs are still very good at matching/reproducing code it has seen. Besides that I've found that LLMs are really bad at novel problems that were not seen in the training data.

Also, the reward functions that you mention don't necessarily lead to great code, only running code. The should be possible in the third bullet point does very heavy lifting.

At any rate, I can be convinced that LLMs will lead to substantially-reduced teams. There is a lot of junior-level code that I can let an LLM write and for non-junior level code, you can write/refactor things much faster than by hand, but you need a domain/API/design expert to supervise the LLM. I think in the end it makes programming much more interesting, because you can focus on the interesting problems, and less on the boilerplate, searching API docs, etc.

[1] https://ibb.co/pvm5DqPh

replies(1): >>jorvi+xv1
◧◩◪
6. energy+1g[view] [source] [discussion] 2025-05-07 06:44:44
>>AnIris+od
I'll hand it to you that only part of the problem is easily represented in automatic verification. It's not easy to design a good reward model for softer things like architectural choices, asking for feedback before starting a project, etc. The LLM will be trained to make the tests pass, and make the code take some inputs and produce desired outputs, and it will do that better than any human, but that is going to be slightly misaligned with what we actually want.

So, it doesn't map cleanly onto previously solved problems, even though there's a decent amount of overlap. But I'd like to add a question to this discussion:

- Can we design clever reward models that punish bad architectural choices, executing on unclear intent, etc? I'm sure there's scope beyond the naive "make code that maps input -> output", even if it requires heuristics or the like.

replies(1): >>tomato+b71
◧◩
7. klabb3+BI[view] [source] [discussion] 2025-05-07 12:09:45
>>energy+kb
> The LLM skeptics need to point out what differs with code compared to Chess, DoTA, etc from a RL perspective.

I see the burden of proof has been reversed. That’s stage 2 already of the hubris cycle.

On a serious note, these are nothing alike. Games have a clear reward function. Software architecture is extremely difficult to even agree on basic principles. We regularly invalidate previous ”best advice”, and we have many conflicting goals. Tradeoffs are a thing.

Secondly programming has negative requirements that aren’t verifiable. Security is the perfect example. You don’t make a crypto library with unit tests.

Third, you have the spec problem. What is the correct logic in edge cases? That can be verified but needs to be decided. Also a massive space of subtle decisions.

replies(1): >>zoogen+0L1
◧◩◪
8. Hautho+OK[view] [source] [discussion] 2025-05-07 12:27:52
>>AnIris+od
This is in fact not how a chess engine works. It has an evaluation function that assigns a numerical value (score) based on a number of factors (material advantage, king "safety", pawn structure etc).

These heuristics are certainly "good enough" that Stockfish is able to beat the strongest humans, but it's rarely possible for a chess engine to determine if a position results in mate.

I guess the question is whether we can write a good enough objective function that would encapsulate all the relevant attributes of "good code".

replies(2): >>svnt+Hq1 >>AnIris+vl2
◧◩◪◨
9. tomato+b71[view] [source] [discussion] 2025-05-07 14:28:32
>>energy+1g
the promo process :P no noise there!
◧◩◪◨
10. svnt+Hq1[view] [source] [discussion] 2025-05-07 15:59:19
>>Hautho+OK
Maybe I am misunderstanding what you are saying, but eg stockfish, given time and threads, seems very good at finding forced checkmates within 20 or more moves.
◧◩
11. jorvi+xv1[view] [source] [discussion] 2025-05-07 16:26:10
>>microt+8e
I asked ChatGPT, Claude, Gemini and DeepSeek what the AE and OE mean in "Harman AE OE 2018 curve". All of them made up complete bullshit, even for the OE (Over Ear) term. AE is Around Ear. The OE term is absurdly easy to find even with the most basic of search skills, and is in fact the fourth result on Google.

The problem with LLMs isn't that they can't do great stuff: it's that you can't trust them to do it consistently. Which means you have to verify what they do, which means you need domain knowledge.

Until the next big evolution in LLMs or a revolution from something else, we'll be alright.

replies(1): >>KoolKa+Qz1
12. ssalaz+Nw1[view] [source] 2025-05-07 16:33:18
>>xyzzy1+(OP)
Thanks -- this is much more thoughtful than the persistent chorus of "just trust me, bro".
◧◩◪
13. KoolKa+Qz1[view] [source] [discussion] 2025-05-07 16:48:23
>>jorvi+xv1
Both Gemini 2.5 Flash and Kagi's small built in model in their search got this right first try.
replies(1): >>jorvi+0N1
◧◩◪
14. zoogen+0L1[view] [source] [discussion] 2025-05-07 17:47:02
>>klabb3+BI
> I see the burden of proof has been reversed.

Isn't this just a pot calling the kettle black? I'm not sure why either side has the rightful position of "my opinion is right until you prove otherwise".

We're talking about predictions for the future, anyone claiming to be "right" is lacking humility. The only think going on is people justifying their opinions, no one can offer "proof".

replies(1): >>klabb3+EJ2
◧◩◪◨
15. jorvi+0N1[view] [source] [discussion] 2025-05-07 17:58:43
>>KoolKa+Qz1
That is my point though. Gemini got it wrong for me. Which means it is inconsistent.

Say you and I ask Gemini what the perfect internal temperature for a medium-rare steak is. It tells me 72c, and it tells you 55c.

Even if it tells 990 people 55c and 10 people 55c, with a tens to hundreds of million users that is still a gargantuan amount of ruined steaks.

replies(1): >>KoolKa+UZ1
◧◩◪◨⬒
16. KoolKa+UZ1[view] [source] [discussion] 2025-05-07 19:14:05
>>jorvi+0N1
I know what you're saying, I guess it depends on the use case and it depends on the context. Pretty much like asking someone off the street something random. Ask someone about an apple some may say a computer and others a fruit.

But you're right though.

◧◩◪◨
17. AnIris+vl2[view] [source] [discussion] 2025-05-07 21:50:50
>>Hautho+OK
An automated objective function is indeed core to how alphago, alphazero, and other RL + deep learning approaches work. Though it is obviously much more complex, and integrated into a larger system.

The core of these approaches are "self-play" which is where the "superhuman" qualities arise. The system plays billions of games against itself, and uses the data from those games to further refine itself. It seems that an automated "referee" (objective function) is an inescapable requirement for unsupervised self-play.

I would suggest that Stockfish and other older chess engines are not a good analogy for this discussion. Worth noting though that even Stockfish no longer uses a hand written objective function on extracted features like you describe. It instead uses a highly optimized neutral network trained on millions of positions from human games.

◧◩◪◨
18. klabb3+EJ2[view] [source] [discussion] 2025-05-08 01:59:48
>>zoogen+0L1
> Isn't this just a pot calling the kettle black?

New expression to me, thanks.

But yes, and no. I’d agree in the sense that the null hypothesis is crucial, possible the main divider between optimists and pessimists. But I’ll still hold firm that the baseline should be predicting that transformer based AI differs from humans in ability since everything from neural architecture, training, and inference works differently. But most importantly, existing AI vary dramatically in ability across domains, where AI exceeds human ability in some and fail miserably in others.

Another way to interpret the advancement of AI is viewing it as a mirror directed at our neurophysiology. Clearly, lots of things we thought were different, like pattern matching in audio- or visual spaces, are more similar than we thought. Other things, like novel discoveries and reasoning, appear to require different processes altogether (or otherwise, we’d see similar strength in those, given that training data is full of them).

replies(1): >>mrkstu+mG3
◧◩◪◨⬒
19. mrkstu+mG3[view] [source] [discussion] 2025-05-08 13:43:24
>>klabb3+EJ2
I think the difference it that computers tend to be pretty good at thing we can do autonomically- ride a bike, drive a car in non-novel/dangerous sitations and things that are advanced versions of unreasoned speech - regurgitations/reformulations of things it can gather from a large corpus and cast into it’s neural net.

They fail at things requiring novel reasoning not already extant in its corpus, a sense of self, or an actual ability to continuously learn from experience, though those things can be programmed in manually as secondary, shallow characteristics.

[go to top]