Cursor's latest “browser experiment” implied success without evidence

>>embedd+(OP)
The blog[0] is worded rather conservatively but on Twitter [2] the claim is pretty obvious and the hype effect is achieved [2]

CEO stated "We built a browser with GPT-5.2 in Cursor"

instead of

"by dividing agents into planners and workers we managed to get them busy for weeks creating thousands of commits to the main branch, resolving merge conflicts along the way. The repo is 1M+ lines of code but the code does not work (yet)"

[0] https://cursor.com/blog/scaling-agents

[1] https://x.com/kimmonismus/status/2011776630440558799

[2] https://x.com/mntruell/status/2011562190286045552

[3]https://www.reddit.com/r/singularity/comments/1qd541a/ceo_of...

>>paulus+0w
Even then, "resolving merge conflicts along the way" doesn't mean anything, as there are two trivial merge strategies that are always guaranteed to work ('ours' and 'theirs').

>>deng+sx
Haha. True, CI success was not part of PR accept criteria at any point.

If you view the PRs, they bundle multiple fixes together, at least according to the commit messages. The next hurdle will be to guardrail agents so that they only implement one task and don't cheat by modifying the CI piepeline

>>paulus+fA
If I had a nickel for every time I've seen a human dev disable/xfail/remove a failing test "because it's wrong" and then proceeding to break production I would have several nickels, which is not much, but does suggest that deleting failing tests, like many behaviors, is not LLM-specific.

>>former+NB
> but does suggest that deleting failing tests, like many behaviors, is not LLM-specific.

True, but it is shocking how often claude suggests just disabling or removing tests.

>>vizzie+TX
> it is shocking how often claude suggests just disabling or removing tests.

Arguably, Claude is simply successfully channeling what the developers who wrote the bulk of its training data would do. We've already seen how bad behavior injected into LLMs in one domain causes bad behavior in other domains, so I don't find this particularly shocking.

The next frontier in LLMs has to be distinguishing good training data from bad training data. The companies have to do this, even if only in self defense against the new onslaught of AI-generated slop, and against deliberate LLM poisoning.

If the models become better at critically distinguishing good from bad inputs, particularly if they can learn to treat bad inputs as examples of what not to do, I would expect one benefit of this is that the increased ability of the models to write working code will then greatly increase the willingness of the models to do so, rather than to simply disable failing tests.

zlacker