zlacker

[parent] [thread] 2 comments
1. yorwba+(OP)[view] [source] 2026-02-03 17:04:50
SWE-Bench Pro consists of 1865 tasks. https://arxiv.org/abs/2509.16941 Qwen3-Coder-Next solved 44.3% (826 or 827) of these tasks. To solve a single task, it took between ≈50 and ≈280 agent turns, ≈150 on average. In other words, a single pass through the dataset took ≈280000 agent turns. Kimi-K2.5 solved ≈84 fewer tasks, but also only took about a third as many agent turns.
replies(2): >>regula+4c >>zamada+In
2. regula+4c[view] [source] 2026-02-03 17:53:49
>>yorwba+(OP)
If this is genuinely better than K2.5 even at a third the speed then my openrouter credits are going to go unused.
3. zamada+In[view] [source] 2026-02-03 18:36:11
>>yorwba+(OP)
Ah, a spread of the individual tests makes plenty of sense! Many thanks (same goes to the other comments).
[go to top]