Cursor's latest “browser experiment” implied success without evidence

>>embedd+(OP)
The blog[0] is worded rather conservatively but on Twitter [2] the claim is pretty obvious and the hype effect is achieved [2]

CEO stated "We built a browser with GPT-5.2 in Cursor"

instead of

"by dividing agents into planners and workers we managed to get them busy for weeks creating thousands of commits to the main branch, resolving merge conflicts along the way. The repo is 1M+ lines of code but the code does not work (yet)"

[0] https://cursor.com/blog/scaling-agents

[1] https://x.com/kimmonismus/status/2011776630440558799

[2] https://x.com/mntruell/status/2011562190286045552

[3]https://www.reddit.com/r/singularity/comments/1qd541a/ceo_of...

>>paulus+0w
Even then, "resolving merge conflicts along the way" doesn't mean anything, as there are two trivial merge strategies that are always guaranteed to work ('ours' and 'theirs').

>>deng+sx
Haha. True, CI success was not part of PR accept criteria at any point.

If you view the PRs, they bundle multiple fixes together, at least according to the commit messages. The next hurdle will be to guardrail agents so that they only implement one task and don't cheat by modifying the CI piepeline

>>paulus+fA
If I had a nickel for every time I've seen a human dev disable/xfail/remove a failing test "because it's wrong" and then proceeding to break production I would have several nickels, which is not much, but does suggest that deleting failing tests, like many behaviors, is not LLM-specific.

>>former+NB
If anything, the LLMs had to learn that from somewhere, so they're just copying human behaviour.

>>Tade0+VB2
I'm definitely in the camp that this browser implementation is shit, but just a reminder: agent training does involve human coding data in early stages of training to bootstrap it but in its reinforcement learning phase it does not -- it learns closer to the way AlphaGo did, self play and verifiable rewards. This is why people are very bullish on agents, there is no limit technically to how well they can learn (unlike LLMs) and we know we will reach superhuman skill, and the crucial crucial reason for this is: verifiable rewards. You have this for coding, you do not have this for e.g. creative tasks etc.

So agents will actually be able to build a {browser, library, etc} that won't be an absolute slopfest, but the real crucial question is when. You need better and more efficient RL training, further scaling (Amodei thinks really scaling is the only thing you technically need here and we have about 3-4 orders of magnitude of headroom left before we hit insurmountable limits), bigger context windows (that models actually handle well) and possibly continual learning paradigms, but solutions to these problems are quite tangible now.

zlacker