zlacker

LLMs will still hit a ceiling without human-like reasoning. Even two weeks ago, Claude 3.7 made basic mistakes like trying to convince me the <= and >= operators on Python sets have the same semantics [1]. Any human would quickly reject something like that (why would be two different operators evaluate to the same value), unless there is overwhelming evidence. Mistakes like this show up all the time, which makes me believe LLMs are still very good at matching/reproducing code it has seen. Besides that I've found that LLMs are really bad at novel problems that were not seen in the training data.

Also, the reward functions that you mention don't necessarily lead to great code, only running code. The should be possible in the third bullet point does very heavy lifting.

At any rate, I can be convinced that LLMs will lead to substantially-reduced teams. There is a lot of junior-level code that I can let an LLM write and for non-junior level code, you can write/refactor things much faster than by hand, but you need a domain/API/design expert to supervise the LLM. I think in the end it makes programming much more interesting, because you can focus on the interesting problems, and less on the boilerplate, searching API docs, etc.

[1] https://ibb.co/pvm5DqPh

replies(1): >>jorvi+ph1

>>microt+(OP)
I asked ChatGPT, Claude, Gemini and DeepSeek what the AE and OE mean in "Harman AE OE 2018 curve". All of them made up complete bullshit, even for the OE (Over Ear) term. AE is Around Ear. The OE term is absurdly easy to find even with the most basic of search skills, and is in fact the fourth result on Google.

The problem with LLMs isn't that they can't do great stuff: it's that you can't trust them to do it consistently. Which means you have to verify what they do, which means you need domain knowledge.

Until the next big evolution in LLMs or a revolution from something else, we'll be alright.

replies(1): >>KoolKa+Il1

>>jorvi+ph1
Both Gemini 2.5 Flash and Kagi's small built in model in their search got this right first try.

replies(1): >>jorvi+Sy1

>>KoolKa+Il1
That is my point though. Gemini got it wrong for me. Which means it is inconsistent.

Say you and I ask Gemini what the perfect internal temperature for a medium-rare steak is. It tells me 72c, and it tells you 55c.

Even if it tells 990 people 55c and 10 people 55c, with a tens to hundreds of million users that is still a gargantuan amount of ruined steaks.

replies(1): >>KoolKa+ML1

>>jorvi+Sy1
I know what you're saying, I guess it depends on the use case and it depends on the context. Pretty much like asking someone off the street something random. Ask someone about an apple some may say a computer and others a fruit.

But you're right though.