I think this kind of approach is interesting, but it's a bit sad that Cursor didn't discuss how they close the feedback loop: testing/verification. As generating code becomes cheaper, I think effort will shift to how we can more cheaply and reliably determine whether an arbitrary piece of code meets a desired specification. For example did they use https://web-platform-tests.org/, fuzz testing (e.g. feed in random webpages and inform the LLM when the fuzzer finds crashes), etc? I would imagine truly scaling long-running autonomous coding would have an emphasis on this.
Of course Cursor may well have done this, but it wasn't super deeply discussed in their blog post.
I really enjoy reading your blog and it would be super cool to see you look at approaches people have to ensuring that LLM-produced code is reliable/correct.
To leverage AI to build a working browser you would imo need the following:
- A team of humans with some good ideas on how to improve on existing web engines.
- A clear architectural story written not by agents but by humans. Architecture does not mean high-level diagrams only. At each level of abstraction, you need humans to decide what makes sense and only use the agent to bang out slight variations.
- A modular and human-overseen agentic loop approach: one agent can keep running to try to fix a specific CSS feature(like grid), with a human expert reviewing the work at some interval(not sure how fine-grained it should be). This is actually very similar to running an open-source project: you have code owners and a modular review process, not just an army of contributor committing whatever they want. And a "judge agent" is not the same thing as a human code owner as reviewer.
Example on how not to do it: https://github.com/wilsonzlin/fastrender/blob/19bf1036105d4e...
This rendering loop architecture makes zero sense, and it does not implement web standards.
> in the HTML Standard, requestAnimationFrame is part of the frame rendering steps (“update the rendering”), which occur after running a task and performing a microtask checkpoint
> requestAnimationFrame callbacks run on the frame schedule, not as normal tasks.
This is BS: "update the rendering" is specified as just another task, which means it needs to be followed by a microtask checkpoint. See https://html.spec.whatwg.org/multipage/#event-loop-processin...
Following the spec doesn't mean you cannot optimize rendering tasks in some way vs other tasks in your implementation, but the above is not that, it's classic AI bs.
Understanding Web standards and translating them into an implementation requires human judgement.
Don't use an agent to draft your architecture; an expert in web standards with a interest in agentic coding is what is required.
Message to Cursor CEO: next time, instead of lighting up those millions on fire, reach out to me first: https://github.com/gterzian
How much effort would it take for a group of humans to do it?
But in general, my guess at an answer(supported by the results of the experiment discussed on this thread), is that:
- GenAi left unsupervised cannot write a browser/engine, or any other complex software. What you end-up with is just chaos.
- A group of humans using GenAi and supervising it's output could write such an engine(or any other complex software), and in theory be more productive than a group of humans not using GenAi: the humans could focus on the conceptual bottlenecks, and the Ai could bang-out the features that require only the translation of already established architectural patterns.
When I write conceptual bottlenecks I don't mean standing in front of a whiteboard full of diagrams. What I mean is any work the gives proper meaning and functionality to the code: it can be at the level of an individual function, or the project as a whole. It can also be outside of the code itself, such as when you describe the desired behavior of (some part of) a program in TLA+.
For an example, see: https://medium.com/@polyglot_factotum/on-writing-with-ai-87c...
“This is a clear indication that while the AI can write the code, it cannot design software”
To clarify what I mean by a product. If we want to design a browser system (engine + chrome) from scratch to optimize the human computer symbiosis (Licklider), what would be the best approach? Who should take the roles of making design decisions, implementation decisions, engineering decisions and supervision?
We can imagine a whole system with human out of the loop, that would be a huge unit test and integration test with no real application.
Then human can study it and learn from it.
Or the other way around, we had already made a huge mess of engineering beasts and machine will learn to fix our mess or make it worse by order of magnitude.
I don’t have an answer.
I used to be a big fan of TDD and now I am not, the testing system is a big mess by itself.
Thanks.
> what would be the best approach?
I don't know but it sounds like an interesting research topic.