Scaling long-running autonomous coding

>>srames+(OP)
So AI makes it cheaper to remix anything already-seen, or anything with a stable pattern, if you’re willing to throw enough resources at it.

AI makes it cheap (eventually almost free) to traverse the already-discovered and reach the edge of uncharted territory. If we think of a sphere, where we start at the center, and the surface is the edge of uncharted territory, then AI lets you move instantly to the surface.

If anything solved becomes cheap to re-instantiate, does R&D reach a point where it can’t ever pay off? Why would one pay for the long-researched thing when they can get it for free tomorrow? There will be some value in having it today, just like having knowledge about a stock today is more valuable than the same knowledge learned tomorrow. But does value itself go away for anything digital, and only remain for anything non-copyable?

The volume of a sphere grows faster than the surface area. But if traversing the interior is instant and frictionless, what does that imply?

>>halfca+mm
The fundamental idea that modern LLMs can only ever remix, even if its technically true (doubt), in my opinion only says to me that all knowledge is only ever a remix, perhaps even mathematically so. Anyone who still keeps implying these are statistical parrots or whatever is just going to regret these decisions in the future.

>>ramraj+jr
Yeah, Yann LeCun is just some luddite lol

>>heavys+Ss
I don't think he's a luddite at all. He's brilliant in what he does, but he can also be wrong in his predictions (as are all humans from time to time). He did have 3 main predictions in ~23-24 that turned out to be wrong in hindsight. Debatable why they were wrong, but yeah.

In a stage interview (a bit after the "sparks of agi in gpt4" paper came out) he made 3 statemets:

a) llms can't do math. They can trick us with poems and subjective prose, but at objective math they fail.

b) they can't plan

c) by the nature of their autoregressive architecture, errors compound. so a wrong token will make their output irreversibly wrong, and spiral out of control.

I think we can safely say that all of these turned out to be wrong. It's very possible that he meant something more abstract, and technical at its core, but in the real life all of these things were overcome. So, not a luddite, but also not a seer.

>>Nitpic+Rw
Have this shortcomings of llms been addressed by better models or by better integration with other tools? Like, are they better at coding because the models are truly better or because the agentic loops are better designed?

>>gjadi+mx
100% by better models. Since his talk models have gained more context windows (up to usable 1M), and RL (reinforcement learning) has been amazing at both picking out good traces, and taught the LLMs how to backtrack and overcome earlier wrong tokens. On top of that, RLAIF (RL with AI feedback) made earlier models better and RLVR (RL with verifiable rewards) has made them very good at both math and coding.

The harnesses have helped in training the models themselves (i.e. every good trace was "baked in" the model) and have improved in enabling test time compute. But at the end of the day this is all put back into the models, and they become better.

The simplest proof of this is on benchmarks like terminalbench and swe-bench with simple agents. The current top models are much better than their previous versions, when put in a loop with just a "bash tool". There's a ~100LoC harness called mini-swe-agent [1] that does just that.

So current models + minimal loop >> previous gen models with human written harnesses + lots of glue.

> Gemini 3 Pro reaches 74% on SWE-bench verified with mini-swe-agent!

[1] - https://github.com/SWE-agent/mini-swe-agent

zlacker