Scaling long-running autonomous coding

>>srames+(OP)
One of the big open questions for me right now concerns how library dependencies are used.

Most of the big ones are things like skia, harfbuzz, wgpu - all totally reasonable IMO.

The two that stand out for me as more notable are html5ever for parsing HTML and taffy for handling CSS grids and flexbox - that's vendored with an explanation of some minor changes here: https://github.com/wilsonzlin/fastrender/blob/19bf1036105d4e...

Taffy a solid library choice, but it's probably the most robust ammunition for anyone who wants to argue that this shouldn't count as a "from scratch" rendering engine.

I don't think it detracts much if at all from FastRender as an example of what an army of coding agents can help a single engineer achieve in a few weeks of work.

>>simonw+vf
I think the other question is how far away this is from a "working" browser. It isn't impossible to render a meaningful subset of HTML (especially when you use external libraries to handle a lot of this). The real difficulty is doing this (a) quickly, (b) correctly and (c) securely. All of those are very hard problems, and also quite tricky to verify.

I think this kind of approach is interesting, but it's a bit sad that Cursor didn't discuss how they close the feedback loop: testing/verification. As generating code becomes cheaper, I think effort will shift to how we can more cheaply and reliably determine whether an arbitrary piece of code meets a desired specification. For example did they use https://web-platform-tests.org/, fuzz testing (e.g. feed in random webpages and inform the LLM when the fuzzer finds crashes), etc? I would imagine truly scaling long-running autonomous coding would have an emphasis on this.

Of course Cursor may well have done this, but it wasn't super deeply discussed in their blog post.

I really enjoy reading your blog and it would be super cool to see you look at approaches people have to ensuring that LLM-produced code is reliable/correct.

>>gjadi+mx
100% by better models. Since his talk models have gained more context windows (up to usable 1M), and RL (reinforcement learning) has been amazing at both picking out good traces, and taught the LLMs how to backtrack and overcome earlier wrong tokens. On top of that, RLAIF (RL with AI feedback) made earlier models better and RLVR (RL with verifiable rewards) has made them very good at both math and coding.

The harnesses have helped in training the models themselves (i.e. every good trace was "baked in" the model) and have improved in enabling test time compute. But at the end of the day this is all put back into the models, and they become better.

The simplest proof of this is on benchmarks like terminalbench and swe-bench with simple agents. The current top models are much better than their previous versions, when put in a loop with just a "bash tool". There's a ~100LoC harness called mini-swe-agent [1] that does just that.

So current models + minimal loop >> previous gen models with human written harnesses + lots of glue.

> Gemini 3 Pro reaches 74% on SWE-bench verified with mini-swe-agent!

[1] - https://github.com/SWE-agent/mini-swe-agent

>>simonw+vf
For me, the biggest open question is currently "How autonomous is 'autonomous'?" because the commits make it clear there were multiple actors involved in contributing to the repository, and the timing/merges make it seem like a human might have been involved with choosing what to merge (but hard to know 100%) and also making smaller commits of their own. I'm really curious to understand what exactly "It ran uninterrupted for one week" means, which was one of Cursor's claims.

I've reached out to the engineer who seemed to have run the experiment, who hopefully can shed some more light on it and (hopefully) my update to >>46646777 will include the replies and more investigations.

>>teaear+Fs
No, it has its own JS implementation: https://github.com/wilsonzlin/fastrender/tree/main/vendor/ec...

See also: >>46650998

>>Gazoch+u11
Life is more fun as a scruffie.

[0] http://www.catb.org/~esr/jargon/html/N/neats-vs--scruffies.h...

>>Zababa+Bo1
We know what an LLM is in fact you can build one from scratch if you like. e.g. https://www.manning.com/books/build-a-large-language-model-f...

It's an algorithm and a completely mechanical process which you can quite literally copy time and time again. Unless of course you think 'physical' computers have magical powers that a pen and paper Turing machine doesn't?

> Many people are throwing around that they don't "think", that they aren't "conscious", that they don't "reason", but I don't see those people sharing interesting heuristics to use LLMs well.

My digital thermometer doesn't think. Imbibing LLM's with thought will start leading to some absurd conclusions.

A cursory read of basic philosophy would help elucidate why casually saying LLM's think, reason etc is not good enough.

What is thinking? What is intelligence? What is consciousness? These questions are difficult to answer. There is NO clear definition. Some things are so hard to define (and people have tried for centuries) e.g. what is consciousness? That they are a problem set within themselves please see Hard problem of consciousness.

https://en.wikipedia.org/wiki/Hard_problem_of_consciousness

>>ramraj+jr
Why doubt? Transformers are a form of kernel smoothing [1]. It's literally interpolation [2]. That doesn't mean it can only echo the exact items in its training data - generating new data items is the entire point of interpolation - but it does mean it's "remixing" (literally forming a weighted sum of) those items and we would expect it to lose fidelity when moving outside the area covered by those points - i.e. where it attempts to extrapolate. And indeed we do see that, and for some reason we call it "hallucinating".

The subsequent argument that "LLMs only remix" => "all knowledge is a remix" seems absurd, and I'm surprised to have seen it now more than once here. Humanity didn't get from discovering fire to launching the JWST solely by remixing existing knowledge.

[1] http://bactra.org/notebooks/nn-attention-and-transformers.ht...

[2] Well, smoothing/estimation but the difference doesn't matter for my point.

>>NiloCK+Yk1
Yeah, I admit I'm probably not doing that quite optimally. I'm still just letting the LLM generate ephemeral .md files that I delete after a certain task is done.

The other day I found [beads](https://github.com/steveyegge/beads) and thought maybe that could be a good improvement over my current state.

But I'm quite hesitant because I also have seen these AGENTS.md files become stale and then there is also the question of how much information is too much especially with the limited context windows.

Probably all things that could again just be solved by leveraging AI more and I'm just an LLM noob. :D

>>srames+(OP)
I'm a maintainer of Servo which is another web engine project.

Although I dissented on the decision, we banned the use of AI. Outside of the project I've been enjoying agentic coding and I do think it can be used already today to build production-grade software of browser-like complexity.

But this project shows that autonomous agents without human oversight is not the way forward.

Why? Because the generated code makes little sense from a conceptual perspective and does not provide a foundation on which to eventually build an entire web engine.

For example, I've just looked into the IndexedDB implementation, which happens to be what I am working on at the moment in Servo.

Now, my work in Servo is incomplete, but conceptually the code that is in place makes sense and there is a clear path towards eventually implementing the thing as a whole.

In Fastrender, you see an Arc<Mutex<Database>> which is never going to work, because by definition a production browser engine will have to involve multiple processes. That doesn't mean you need the IPC in a prototype, but you certainly should not have shared state--some simple messaging between threads or tasks would do.

The above is an easy coding fix for the AI, but it requires input from a human with a pretty good idea of what the architecture should look like.

For comparison, when I look at the code in Ladybird, yet another browser project, I can immediately find my way around what for me is a stranger codebase: not just a single file but across large swaths of the project and understand things like how their rendering loop works. With Fastrender I find it hard to find my way around, despite all the architectural diagrams in the README.

So what do I propose instead of long-running autonomous agents? The focus should shift towards demonstrating how AI can effectively assist humans in building well-architected software. The AI is great at coding, but you eventually run into what I call conceptual bottlenecks, which can be overcome with human oversight. I've written about this elsewhere: https://medium.com/@polyglot_factotum/on-writing-with-ai-87c...

There is one very good idea in the project: adding the web standards directly in the repo so it can be used as context by the AI and humans alike. Any project can apply this by adding specs and other artifacts right next to the code. I've been doing this myself with TLA+, see https://medium.com/@polyglot_factotum/tla-in-support-of-ai-c...

To further ground the AI code output, I suggest telling it to document the code with the corresponding lines from the spec.

Back in early 2025 when we had those discussions in Servo about whether to allow some use of AI, I wrote this guide https://gist.github.com/gterzian/26d07e24d7fc59f5c713ecff35d... which I think is also the kind of context you want to give the AI. Note that this was back in the days of accepting edits with tabs...

>>cheevl+dG
There's AI based 3d asset generation tools around. For example https://www.meshy.ai/ https://hyper3d.ai/ https://www.sloyd.ai/

>>srames+(OP)
Wow, for screenshots much faster than chromium:

  $ time target/release/fetch_and_render "https://www.lauf-goethe-lauf.de/"
  real 0m0,685s
  user 0m0,548s
  sys 0m0,070s
  
  $ time chromium --headless --disable-gpu --screenshot=out.png --window-size=1200,800 https://www.lauf-goethe-lauf.de/
  real 0m1,099s
  user 0m0,927s
  sys 0m0,692s

# edit: with a hot-standby chrome and a running node instance a can reach 0,369s seconds here

>>sealec+4g
I think the current approach is simply not scalable to a working browser ever.

To leverage AI to build a working browser you would imo need the following:

- A team of humans with some good ideas on how to improve on existing web engines.

- A clear architectural story written not by agents but by humans. Architecture does not mean high-level diagrams only. At each level of abstraction, you need humans to decide what makes sense and only use the agent to bang out slight variations.

- A modular and human-overseen agentic loop approach: one agent can keep running to try to fix a specific CSS feature(like grid), with a human expert reviewing the work at some interval(not sure how fine-grained it should be). This is actually very similar to running an open-source project: you have code owners and a modular review process, not just an army of contributor committing whatever they want. And a "judge agent" is not the same thing as a human code owner as reviewer.

Example on how not to do it: https://github.com/wilsonzlin/fastrender/blob/19bf1036105d4e...

This rendering loop architecture makes zero sense, and it does not implement web standards.

> in the HTML Standard, requestAnimationFrame is part of the frame rendering steps (“update the rendering”), which occur after running a task and performing a microtask checkpoint

> requestAnimationFrame callbacks run on the frame schedule, not as normal tasks.

This is BS: "update the rendering" is specified as just another task, which means it needs to be followed by a microtask checkpoint. See https://html.spec.whatwg.org/multipage/#event-loop-processin...

Following the spec doesn't mean you cannot optimize rendering tasks in some way vs other tasks in your implementation, but the above is not that, it's classic AI bs.

Understanding Web standards and translating them into an implementation requires human judgement.

Don't use an agent to draft your architecture; an expert in web standards with a interest in agentic coding is what is required.

Message to Cursor CEO: next time, instead of lighting up those millions on fire, reach out to me first: https://github.com/gterzian

>>ontouc+ef8
I'm not sure about what you mean with your first sentence in terms of product.

But in general, my guess at an answer(supported by the results of the experiment discussed on this thread), is that:

- GenAi left unsupervised cannot write a browser/engine, or any other complex software. What you end-up with is just chaos.

- A group of humans using GenAi and supervising it's output could write such an engine(or any other complex software), and in theory be more productive than a group of humans not using GenAi: the humans could focus on the conceptual bottlenecks, and the Ai could bang-out the features that require only the translation of already established architectural patterns.

When I write conceptual bottlenecks I don't mean standing in front of a whiteboard full of diagrams. What I mean is any work the gives proper meaning and functionality to the code: it can be at the level of an individual function, or the project as a whole. It can also be outside of the code itself, such as when you describe the desired behavior of (some part of) a program in TLA+.

For an example, see: https://medium.com/@polyglot_factotum/on-writing-with-ai-87c...

zlacker

Scaling long-running autonomous coding