zlacker

[parent] [thread] 2 comments
1. zamada+(OP)[view] [source] 2026-02-03 16:35:13
Much appreciated, but I mean more around "what do the error bars in the figure represent" than what the turn scaling itself is.
replies(2): >>esafak+s7 >>jsnell+A7
2. esafak+s7[view] [source] 2026-02-03 17:03:31
>>zamada+(OP)
For the tasks in SWE-Bench Pro they obtained a distribution of agent turns, summarized as the box plot. The box likely describes the inter-quartile range while the whiskers describe the some other range. You'd have to read their report to be sure. https://en.wikipedia.org/wiki/Box_plot
3. jsnell+A7[view] [source] 2026-02-03 17:03:55
>>zamada+(OP)
That's a box plot, so those are not error bars but a visualization of the distribution of a metric (min, max, median, 25th percentile, 75th percentile).

The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.

[go to top]