Seems like the key question is: should we expect AI programming performance to scale well as more compute and specialised training is thrown at it? I don't see why not, it seems an almost ideal problem domain?
* Short and direct feedback loops
* Relatively easy to "ground" the LLM by running code
* Self-play / RL should be possible (it seems likely that you could also optimise for aesthetics of solutions based on common human preferences)
* Obvious economic value (based on the multi-billion dollar valuations of vscode forks)
All these things point to programming being "solved" much sooner than say, chemistry.
The LLM skeptics need to point out what differs with code compared to Chess, DoTA, etc from a RL perspective. I don't believe they can. Until they can, I'm going to assume that LLMs will soon be better than any living human at writing good code.
An obviously correct automatable objective function? Programming can be generally described as converting a human-defined specification (often very, very rough and loose) into a bunch of precise text files.
Sure, you can use proxies like compilation success / failure and unit tests for RL. But key gaps remain. I'm unaware of any objective function that can grade "do these tests match the intent behind this user request".
Contrast with the automatically verifiable "is a player in checkmate on this board?"
Also, the reward functions that you mention don't necessarily lead to great code, only running code. The should be possible in the third bullet point does very heavy lifting.
At any rate, I can be convinced that LLMs will lead to substantially-reduced teams. There is a lot of junior-level code that I can let an LLM write and for non-junior level code, you can write/refactor things much faster than by hand, but you need a domain/API/design expert to supervise the LLM. I think in the end it makes programming much more interesting, because you can focus on the interesting problems, and less on the boilerplate, searching API docs, etc.
So, it doesn't map cleanly onto previously solved problems, even though there's a decent amount of overlap. But I'd like to add a question to this discussion:
- Can we design clever reward models that punish bad architectural choices, executing on unclear intent, etc? I'm sure there's scope beyond the naive "make code that maps input -> output", even if it requires heuristics or the like.
I see the burden of proof has been reversed. That’s stage 2 already of the hubris cycle.
On a serious note, these are nothing alike. Games have a clear reward function. Software architecture is extremely difficult to even agree on basic principles. We regularly invalidate previous ”best advice”, and we have many conflicting goals. Tradeoffs are a thing.
Secondly programming has negative requirements that aren’t verifiable. Security is the perfect example. You don’t make a crypto library with unit tests.
Third, you have the spec problem. What is the correct logic in edge cases? That can be verified but needs to be decided. Also a massive space of subtle decisions.
These heuristics are certainly "good enough" that Stockfish is able to beat the strongest humans, but it's rarely possible for a chess engine to determine if a position results in mate.
I guess the question is whether we can write a good enough objective function that would encapsulate all the relevant attributes of "good code".
The problem with LLMs isn't that they can't do great stuff: it's that you can't trust them to do it consistently. Which means you have to verify what they do, which means you need domain knowledge.
Until the next big evolution in LLMs or a revolution from something else, we'll be alright.
Isn't this just a pot calling the kettle black? I'm not sure why either side has the rightful position of "my opinion is right until you prove otherwise".
We're talking about predictions for the future, anyone claiming to be "right" is lacking humility. The only think going on is people justifying their opinions, no one can offer "proof".
Say you and I ask Gemini what the perfect internal temperature for a medium-rare steak is. It tells me 72c, and it tells you 55c.
Even if it tells 990 people 55c and 10 people 55c, with a tens to hundreds of million users that is still a gargantuan amount of ruined steaks.
But you're right though.
The core of these approaches are "self-play" which is where the "superhuman" qualities arise. The system plays billions of games against itself, and uses the data from those games to further refine itself. It seems that an automated "referee" (objective function) is an inescapable requirement for unsupervised self-play.
I would suggest that Stockfish and other older chess engines are not a good analogy for this discussion. Worth noting though that even Stockfish no longer uses a hand written objective function on extracted features like you describe. It instead uses a highly optimized neutral network trained on millions of positions from human games.
New expression to me, thanks.
But yes, and no. I’d agree in the sense that the null hypothesis is crucial, possible the main divider between optimists and pessimists. But I’ll still hold firm that the baseline should be predicting that transformer based AI differs from humans in ability since everything from neural architecture, training, and inference works differently. But most importantly, existing AI vary dramatically in ability across domains, where AI exceeds human ability in some and fail miserably in others.
Another way to interpret the advancement of AI is viewing it as a mirror directed at our neurophysiology. Clearly, lots of things we thought were different, like pattern matching in audio- or visual spaces, are more similar than we thought. Other things, like novel discoveries and reasoning, appear to require different processes altogether (or otherwise, we’d see similar strength in those, given that training data is full of them).
They fail at things requiring novel reasoning not already extant in its corpus, a sense of self, or an actual ability to continuously learn from experience, though those things can be programmed in manually as secondary, shallow characteristics.