zlacker

[parent] [thread] 1 comments
1. kungfu+(OP)[view] [source] 2026-01-15 07:37:00
You raise a good point, which is that autonomous coding needs to be benchmarked on designs/challenges where the exact thing being built isn't part of the model's training set.
replies(1): >>Nitpic+48
2. Nitpic+48[view] [source] 2026-01-15 08:37:19
>>kungfu+(OP)
swe-REbench does this. They gather real issues from github repos on a ~monthly basis, and test the models. On their leaderboard you can use a slider to select issues created after a model was released, and see the stats. It works for open models, a bit uncertain on closed models. Not perfect, but best we have for this idea.
[go to top]