A real-world benchmark for AI code review

>>benoco+(OP)
I'm not as cynical as the others here; if there are no popular code review benchmarks why should they not design one?

Apparently this is in support of their 2.0 release: https://www.qodo.ai/blog/introducing-qodo-2-0-agentic-code-r...

> We believe that code review is not a narrow task; it encompasses many distinct responsibilities that happen at once. [...]

> Qodo 2.0 addresses this with a multi-agent expert review architecture. Instead of treating code review as a single, broad task, Qodo breaks it into focused responsibilities handled by specialized agents. Each agent is optimized for a specific type of analysis and operates with its own dedicated context, rather than competing for attention in a single pass. This allows Qodo to go deeper in each area without slowing reviews down.

> To keep feedback focused, Qodo includes a judge agent that evaluates findings across agents. The judge agent resolves conflicts, removes duplicates, and filters out low-signal results. Only issues that meet a high confidence and relevance threshold make it into the final review.

> Qodo’s agentic PR review extends context beyond the codebase by incorporating pull request history as a first-class signal.

>>benoco+(OP)
Where's the code for this? I'd love to run our tool, https://tachyon.so/, against it.

>>benoco+(OP)
I'm trying to bring a slightly different take to the pricing of ShipItAI (https://shipitai.dev, brazen plug). I've got a $5/mo/active dev + Bring Your Own Key option for those that want better price controls.

Still early in development and has a much simpler goal, but I like simple things that work well.

zlacker

A real-world benchmark for AI code review