If you view the PRs, they bundle multiple fixes together, at least according to the commit messages. The next hurdle will be to guardrail agents so that they only implement one task and don't cheat by modifying the CI piepeline
True, but it is shocking how often claude suggests just disabling or removing tests.
Arguably, Claude is simply successfully channeling what the developers who wrote the bulk of its training data would do. We've already seen how bad behavior injected into LLMs in one domain causes bad behavior in other domains, so I don't find this particularly shocking.
The next frontier in LLMs has to be distinguishing good training data from bad training data. The companies have to do this, even if only in self defense against the new onslaught of AI-generated slop, and against deliberate LLM poisoning.
If the models become better at critically distinguishing good from bad inputs, particularly if they can learn to treat bad inputs as examples of what not to do, I would expect one benefit of this is that the increased ability of the models to write working code will then greatly increase the willingness of the models to do so, rather than to simply disable failing tests.
>"To test this system, we pointed it at an ambitious goal: building a web browser from scratch."
and then near the end, they say:
>"Hundreds of agents can work together on a single codebase for weeks, making real progress on ambitious projects."
This means they only make progress toward it, but do not "build a web browser from scratch".
If you're curious, the State of Utopia (will be available at https://stateofutopia.com ) did build a web browser from scratch, though it used several packages for the networking portion of it.
See my other comments and posts for links.
There are a lot of really bad human developers out there, too.
"Fix the tests." This was interpreted literally, and assert status == 200 got changed to assert status == 500 in several locations. Some tests required more complex edits to make them "pass."
Inquiries about the tests went unanswered. Eventually the 2000 lines of slop was closed without merging.
So you flubbed managing a project and are now blaming your employees. Classy.
Latest example is when I recently vibe coded a little Python MQTT client for a UPS connected to a spare Raspberry Pi to use with Home Assistant, and with a just few turns back and forth I got this extremely cool bespoke tool and felt really fun.
So I spent a while customizing how the data displayed on my Home Assistant dashboard and noticed every single data point was unchanging. It took a while to realize because the available data points wouldn’t be expected to change a whole lot on a fully charged UPS but the voltage and current staying at the exact same value to a decimal place for three hours raised my suspicions.
After reading the code I discovered it had just used one of the sample command line outputs from the UPS tool I gave it to write the CLI parsing logic. When an exception occurred in the parser function it instead returned the sample data so the MQTT portion of the script could still “work”.
Tbf Claude did eventually get it over the finish line once I clarified that yes, using real data from the actual UPS was in fact an important requirement for me in a real time UPS monitoring dashboard…
If LLMs do this it should be seen as an issue and should not be overlooked with “people do it too…”. Professional developers do not do this. If we’re going to use Ai for creating production code we need to be honest about its deficiencies.
It's similar to early versions of autonomous driving. You's not want to sit in the back seat with nobody at the wheel. That would get you killed guaranteed.
Tesla owner keeps using Autopilot from backseat—even after being arrested:
https://mashable.com/article/tesla-autopilot-arrest-driving-...
http://www.mickdarling.com/2019/07/26/busy-summer/
An embedded page at landr-atlas.com says:
Attention!
MacOS Security Center has identified that your system is under threat.
Please scan your MacOS as soon as possible to avoid more damage.
Don't leave this page until you have undertaken all the suggested steps
by authorised Antivirus.
[OK]So agents will actually be able to build a {browser, library, etc} that won't be an absolute slopfest, but the real crucial question is when. You need better and more efficient RL training, further scaling (Amodei thinks really scaling is the only thing you technically need here and we have about 3-4 orders of magnitude of headroom left before we hit insurmountable limits), bigger context windows (that models actually handle well) and possibly continual learning paradigms, but solutions to these problems are quite tangible now.
Testing, specifically, is heavily opinionated among professional developers.
Whether you had anything to do with it or not, I have no idea. And, since you didn't follow best practices and tell me directly rather than trying to score points here, there's really no way of knowing whether you're the one who caused the problem in the first place.
I built a new site without Wordpress. That took in less than a day.
I don't imagine you will alter your behavior to align with general best security practices anytime soon.
Are you actually accusing me (slyly couched in weasel words, but still explicitly) of hacking your wordpress blog, then pointing it out on Hacker News to score points?
Yeah, you have a point /s: there's really no way to tell if I hacked your blog or not, nor any way of knowing whether any statement is true or not if you're nihilistic enough, but you're going to have to take my word that I didn't, and clean up your own mess without shifting the blame to me, or demanding I should have helped you. You're the one who chose to use wordpress, not me. FYI, "general best security practices" include DON'T USE WORDPRESS.
What possible evidence or delusional reasons do you have to imply that I hacked your wordpress blog? Is your security really that lax and password that easy to guess? And even if I did, then why would I post about it publicly or notify you privately? You sound pathologically paranoid and antisocially aggressive to make such baseless accusations out of the blue, to try to shift the blame to me for your own mistakes. That makes me glad I didn't try to contact you directly. Funny thing for you to complain about when you don't even openly publish your contact email address on your blog or hn profile like I do, though.