zlacker

I'm not as cynical as the others here; if there are no popular code review benchmarks why should they not design one?

Apparently this is in support of their 2.0 release: https://www.qodo.ai/blog/introducing-qodo-2-0-agentic-code-r...

> We believe that code review is not a narrow task; it encompasses many distinct responsibilities that happen at once. [...]

> Qodo 2.0 addresses this with a multi-agent expert review architecture. Instead of treating code review as a single, broad task, Qodo breaks it into focused responsibilities handled by specialized agents. Each agent is optimized for a specific type of analysis and operates with its own dedicated context, rather than competing for attention in a single pass. This allows Qodo to go deeper in each area without slowing reviews down.

> To keep feedback focused, Qodo includes a judge agent that evaluates findings across agents. The judge agent resolves conflicts, removes duplicates, and filters out low-signal results. Only issues that meet a high confidence and relevance threshold make it into the final review.

> Qodo’s agentic PR review extends context beyond the codebase by incorporating pull request history as a first-class signal.

replies(1): >>thierr+9b

>>esafak+(OP)
I'm building a benchmark for coding agent memory following your philosophy. There are so many memory tools out there but I have not been able to find a reliable benchmark for coding agent memory. So I'm just building it myself.

A lot of this stuff is really new, and we will need to find ways to standardize, but it will take time and consensus.

It took 4 years after the release of the automobile to coin the term milage to refer to miles driven per unit of gasoline. We will in due time create the same metrics for AI.