Scaling long-running autonomous coding

>>samwil+(OP)
Did anyone manage to run the tests from the repository itself? The code seems filled with errors and warnings, as far as I can tell none of them because of the platform I'm on (Linux). I went and looked at the Action workflow history for some pages, and seems CI been failing for a while, PRs also all been failing CI but merged. How exactly was this verified to be something to be used as an successful example, or am I misunderstanding what point they are trying to make? They mention a screenshot, but they never actually mention if their goal was successfully met, do they?

I'm not sure the approach of "completely autonomous coding" is the right way to go. I feel like maybe we'll be able to use it more effectively if we think of them as something to be used by a human to accomplish some thing instead, lean into letting the human drive the thing instead, because quality spirals so quickly out of control.

>>embedd+Ya
I found the codebase very hard to navigate. Hundreds (over a thousand?) tiny files with less than 200 lines of code, in deeply nested subdirectories. I wanted to find where the JavaScript engine was, and where the DOM implementation was located, and I couldn't easily find it, even using the GitHub search feature. I'm not exactly sure what this browser implements and how.

Even their README is kind of crappy. Ideally you want installation instructions right near the top, but it's broken into multiple files. The README link that says "running + architecture" (but the file is actually called browser_ui.md???) is hard to follow. There is no explicit list of dependencies, and again no explanation of how JavaScript execution works, or how rendering works, really.

It's impressive that they got such a big project to be built by agents and to compile, but this codebase... Feels like AI slop, and you couldn't pay me to maintain it. You could try to get AI agents to maintain it, but my prediction is that past some scale, they would have a hard time figuring out their own mess. You would just be left with permanent bugs you can't easily fix.

>>snek_c+lO
So the chain of events here is: copy existing tutorials and public/available code, train the model to spit it out-ish when asked, a mature-ish specification is used, and now they jitter and jumble towards a facsimile of a junior copy paste outsourcing nightmare they can’t maintain (creating exciting liabilities for all parties involved).

I can’t shake the feeling that simply being a shameless about copy-paste (ie copyright infringement), would let existing tools do much the same faster and more efficiently. Download Chromium, search-replace ‘Google’ with ‘ME!’, run Make… if I put that in a small app someone would explain that’s actually solvable as a bash one-liner.

There’s a lot of utility in better search and natural language interactions. The siren call of feedback loops plays with our sense of time and might be clouding or sense of progress and utility.

>>boness+ib1
You raise a good point, which is that autonomous coding needs to be benchmarked on designs/challenges where the exact thing being built isn't part of the model's training set.

>>kungfu+Le1
swe-REbench does this. They gather real issues from github repos on a ~monthly basis, and test the models. On their leaderboard you can use a slider to select issues created after a model was released, and see the stats. It works for open models, a bit uncertain on closed models. Not perfect, but best we have for this idea.

zlacker